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Abstract 

This paper introduces constNJ, the first algorithm for phylogenetic 
reconstruction of sets of trees with constrained pairwise rooted subtree- 
prune regraft (rSPR) distance. We are motivated by the problem of con- 
structing sets of trees which must fit into a recombination, hybridization, 
or similar network. Rather than first finding a set of trees which are op- 
timal according to a phylogenetic criterion (e.g. likelihood or parsimony) 
and then attempting to fit them into a network, constNJ estimates the 
trees while enforcing specified rSPR distance constraints. The primary 
input for constNJ is a collection of distance matrices derived from se- 
quence blocks which are assumed to have evolved in a tree-like manner, 
such as blocks of an alignment which do not contain any recombination 
breakpoints. The other input is a set of rSPR constraints for any set of 
pairs of trees. constNJ is consistent and a strict generalization of the 
neighbor-joining algorithm; it uses the new notion of "maximum agree- 
ment partitions" to assure that the resulting trees satisfy the given rSPR 
distance constraints. 

1 Introduction 

Since the pioneering paper of Sneath (1975), tens of thousands of papers have 
been published on the subject of "reticulate evolution." "Reticulate evolution" 
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has generally come to mean evolution where genetic material for a new lin- 
eage may come from two or more sources, as in the case of recombination and 
hybridization. The Oxford English Dictionary (1989) defines "reticulated" to 
mean "constructed or arranged like a net; made or marked so as to resemble a 
net or network." Correspondingly rather than evolutionary history being rep- 
resentable as a tree, a network is more appropriate. A considerable amount of 
effort has gone into the phylogenetic reconstruction of these networks. 

Algorithms for phylogcnetics in the presence of reticulation have followed 
a curiously different path then the mainstream of phylogcnetics. As surveyed 
below, current algorithms fall into three types: first, there are algorithms which 
attempt to find the phylogenetic network displaying some fixed characteristics 
(such as splits in an alignment or some set of trees) which contain the mini- 
mum number of reticulation events. Secondly there are algorithms to construct 
"splits networks," which do an excellent job of representing conflicting signals in 
the data, but do not give an explicit evolutionary history. The third approach 
is to sample from the posterior distribution of a population-genetics model, 
such as the coalescent with recombination. None of these approaches furnish a 
practical solution for certain cases, such as HIV researchers who would like to 
reconstruct the evolutionary history of an alignment which includes recombi- 
nant sequences. Indeed, first fixing a set of characteristics and then minimizing 
the number of reticulation events ignores the balance between number of retic- 
ulation events and phylogenetic optimality, the splits network approach does 
not tell a complete evolutionary story, and population-genetic algorithms are 
are not yet sufficiently fast for DNA sequence datasets which have thousands 
of nucleotides. As there are no algorithms which are practical for doing phylo- 
genetic reconstruction in this setting, HIV researchers who wish to reconstruct 
evolutionary history typically proceed in one of two ways: they either treat the 
whole alignment as having a single tree-like history, which cannot possibly be 
correct, or they build trees on sub-alignments independently, which does not 
take into account the underlying network structure. These two extremes, of as- 
suming all trees have the same topology or allowing their topologies to differ in 
arbitrary ways, leave a substantial gap in the middle, where the correct balance 
of optimality and discord should be found. 

The goal of constNJ is to begin filling this gap in a manner analogous to 
classical phylogenetic inference algorithms. To do so, we make a different set of 
assumptions than has previously been done considering the input and desired 
output. Regarding the data, we assume that that the given alignment has been 
segmented into "alignment blocks", each of which can be described by a sin- 
gle tree. For example, in the case of recombination, the alignment blocks are 
the segments of an alignment which do not contain recombination breakpoints. 
(Note that for the purposes of this paper we will be using the word "recom- 
bination" in the general sense, including processes such as gene conversion.) 
Although the assumption that the data comes pre-segmented is a substantial 
one, we don't think that it is unreasonable. From a practical standpoint, some 
assumption needs to be made, as algorithms which attempt to find a correct 
segmentation of the data and a sequence of trees simultaneously have a difficult 
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Figure 1: An example "reticulate" network and the two trees that it contains. 
Those two trees are related by a single rooted subtree prune regraft (rSPR) 
move, whereby the middle subtree is cut off of the tree and reattached at another 
location. The node v will be called a reticulation node. 

time searching the complete space. Furthermore, sometimes a segmentation is 
clear, such as the distinct RNA strands of the influenza genome. Other times, 
such as for recombination, it is not so clear, but even in this more difficult case 
the inference of recombination breakpoints has seen significant progress in the 
last 10 years (reviewed below). We will also assume that an outgroup has been 
selected. Such a choice is crucial, as it establishes directionality for reticulation 
events. 

Regarding desired output, rather than actually building a single reticulate 
network, this paper will focus on building "correlated sets of trees," which dis- 
play the sorts of constraints found on trees which fit in a reticulate network. 
We are focused on building trees because each alignment block is correctly de- 
scribed by a single tree. However, this set of trees must fit into a network, 
which forces constraints on their topology. Specifically, the trees which sit in 
these networks must be related by rooted subtree-prune-regraft (rSPR) moves, 
whereby a rooted subtree is cut from the original tree and then re-attached in 
another location (Figure 1). We describe below how it is necessary for trees 
sitting in a reticulate network to be related by rSPR moves, though this is not 
a sufficient condition. 

For constNJ, we assume that the user can supply a series of constraints 
describing the number of rSPR moves allowable between pairs of alignment 
blocks. For example, if the alignment contains "pure" types and a single class 
of recombinants which are derived from a pair of types, then there should be 
two alignment blocks and the trees for those blocks should be related by one 
rSPR move as in Figure 1. The challenge, then, is to reconstruct a set of trees 
which satisfy the constraints and which together optimize some phylogenetically 
relevant criterion, such as likelihood, parsimony, or balanced minimum evolu- 
tion. Note that constNJ actually constructs a number of such sets of trees, in 
order to display the balance between optimality of the individual trees and the 
number of reticulation events needed to fit the trees together into a network. 

We now present a motivating example. The CRF14_BG circulating recom- 
binant form (CRF) of HIV is known to be a mosaic of subtype B and subtype 
G viruses, and the breakpoints of the recombination events are known (Thom- 
son et at, 2001). We will call the region of the alignment where BG derives 
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Figure 2: Phylogenetic trees of the pure subtypes of HIV and the BG recombi- 
nant clade constructed independently using the no-recombination blocks of the 
HIV genome. The single letters (e.g. A,B,C. . . ) label clades of subtypes, and 
BG denotes a clade of circulating recombinant forms (CRFs) made from B and 
G subtypes. Tree (a) is built from the "G region," i.e. the region where the 
BG CRF derives from the G subtype, and tree (b) is built from the "B region," 
where BG derives from the B subtype. As noted in the text, although these 
trees do place the recombinant strains in the correct locations, they differ in a 
number of important ways which are not explained by recombination events. It 
is the perspective of this paper that these extra differences represent phyloge- 
netic error, and that accuracy can be improved by constraining the trees to fit 
into a recombination network. 



from the G subtype the "G region," and the region where BG derives from 
the B subtype the "B region." As in Thomson et al. (2001) and all similar 
papers we could find in the area, researchers build trees independently on the 
no- recombination blocks. We have repeated such an analysis in Figure 2, build- 
ing PHYML maximum likelihood phylogenetic trees using the F84 model and 
rooted using CPZ.CD.90.ANT.U42720 (removed from tree for clarity). As one 
would hope, the trees do indeed show that the BG CRF derives one portion of 
its RNA from the G subtype, and the other part from the B subtype. However, 
there are many more differences between the two trees than should occur for an 
alignment with a single recombinant strain. For example, the rooting changes 
between the two trees, as does the location of the C and the F-K clades. Build- 
ing a recombination network out of these trees would lead a number of spurious 
hypothesized recombinations. 

In contrast, for this dataset constNJ returns a collection of pairs of trees 
displaying the balance between the number of allowed rSPR moves between 
pairs of trees and phylogenetic optimality. This balance is described in the text 
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output of constNJ, which is shown in Table 1 for the BG dataset. The first 
column shows the rSPR distance between the two reconstructed trees (in this 
case the G region tree and the B region tree). As described below, the notion of 
optimality for constNJ is total tree length, which is a trivial generalization of 
the balanced minimum evolution (BME) criterion (Desper and Gascuel, 2004). 
It is displayed in the second column for the pairs of trees. Thus the second line 
states that constNJ found a pair of trees which differ by a single rSPR move, 
and which have total tree length about 7.119. The third column just shows 
the difference between the second column values between rows. Thus 0.0942 
signifies that there is a decrease of magnitude 0.0942 in total tree length by 
allowing a single rSPR difference between the two trees. 

In this way we can achieve an understanding of the balance between phylo- 
genctic optimality and number of recombination events. For example, we can 
see that the decrease in allowing a single recombination event is significantly 
greater in magnitude than that for allowing two rather than one. And sur- 
prisingly, allowing nine rSPR moves does not significantly decrease the total 
tree length compared to allowing three. Because the improvement in total tree 
length when allowing one rSPR move is significantly greater than that for any 
subsequent rSPR moves, we believe that Table 1 suggests that the data prob- 
ably arose from one recombination event, which agrees with the established 
knowledge concerning these taxa. 



rSPR distance 


total tree length 


tree length difference 





7.213 


0.0942 


1 


7.119 


0.0076 


2 


7.111 


0.0149 


3 


7.096 


0.0139 


9 (independent NJ) 


7.082 





Table 1: The balance between discord and optimality for the example HIV 
dataset. On the left side is the number of SPR moves required to go from the 
tree built on the G region to the tree built on the B region. In the center is the 
total tree length (see Equation 2), which is our notion of optimality. On the 
right is the difference of the total tree length between the rows. As described 
in the text, the largest drop in total tree length comes when allowing a single 
rSPR move (i.e. recombination event) between the two trees, indicating that 
one recombination event is needed to explain the data. 

Furthermore, the trees which constNJ finds assuming a single recombination 
event agree with the accepted recombination history of the BG recombinant 
circulating form (Figure 3). In particular, the only difference between them 
is the location of the BG clade, which switches from the G to the B subclade 
depending on the region analyzed. Importantly, these two trees can fit into a 
recombination network with a single reticulation node, in contrast to those in 
Figure 2. 

We believe that constNJ is the first algorithm of its kind, but will now review 
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Figure 3: One pair of phylogcnetic trees constructed for the same dataset using 
constNJ. In contrast to Figure 2, the only difference between the two trees is 
the location of the BG recombinant clade. These two trees fit into a recombi- 
nation network with a single recombination event, as should be the case for a 
tree for the pure subtypes with a single recombinant strain like we have here. 
constNJ correctly identifies that the BG subtype is a recombinant of the B and 
G subtypes. 



literature on related topics, starting with common terminology. The currently 
accepted term for the class of networks including both hybridization and recom- 
bination networks is "reticulate network". 1 If we consider a rooted tree to be 
a directed graph such that edges are directed away from the root, a reticulate 
network is a rooted phylogcnetic tree with additional directed edges making 
a directed acyclic graph with "tree nodes" of in-degree one and "reticulation 
nodes" of in-degree two (Huson et al., 2005). 

A considerable amount of work has gone into the problem of constructing 
a reticulate network given a set of phylogenetic trees which it must contain. 
This problem was initiated by Maddison (1997) and considerable progress has 
been made by Baroni et al. (2005), Nakhleh et al. (2005), Huson et al. (2005), 
and Bordewich and Scmple (2007a, b). As described above, we differ from these 
approaches as we would like to estimate the trees while ensuring that they fit 
into a reticulate network. 

A related problem (which was the original motivation for fitting trees into 
a network) is to reconcile distinct gene trees into a single species tree. This 
problem has received an appropriately large amount of attention, and has found 
a more realistic model-based formulation in Ane et al. (2007) and Edwards et al. 
(2007). These differ from the present paper because they assume that there is 
a single species tree, and that "correctness" of a gene tree should in part be 

1 This terminology is redundant, as the word "reticulate" already means network-like. 
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judged by the degree to which it fits within a species tree due to a coalescent 
model. In our setting, however, there is no single species tree, and the coalescent 
model may not be appropriate. 

Sometimes a related assumption is made, which is not that complete species 
trees are known, but that the resulting recombination network must display 
a specified collection of bipartitions, which are typically called splits. This is 
equivalent to assuming a supplied alignment evolves according to the infinite 
sites model of mutation. The problem again is to find a network which mini- 
mizes the number recombination events. This problem was first formulated by 
Hudson and Kaplan (1985), and was shown to be NP-hard in Wang et al. (2001). 
Progress was made in a sub-case by Gusfield et al. (2004) and a simpler related 
(and in some ways more realistic) problem was solved by Song and Hein (2005). 
In Huson and Klocppcr (2005), the authors note that the algorithm in Huson 
et al. (2005) can be extended to this case. Although a different formulation, this 
splits/infinite sites approach represents a different version of the same strategy: 
find the network displaying a certain set of characteristics which minimizes the 
number of reticulation events. 

Splits network methods are a biologically useful and mathematically inter- 
esting way of understanding conflicting signals in phylogenetic data. The first 
method to construct splits networks from distance data was the split decompo- 
sition approach of Bandelt and Dress (1992a, b). Another successful approach 
has been the "neighbor-net" algorithm created by Bryant and Moulton (2004) 
and further analyzed by Levy and Pachtcr (2008). These methods form a useful 
complement to phylogenetic analysis in the traditional tree-based sense, but do 
not reconstruct an explicit evolutionary history. We also note that recombina- 
tion networks need not be circular split systems, which are the sorts of splits 
networks returned by neighbor- net. 

On the other end of the spectrum lie likelihood-based methods using the 
coalescent with recombination (Hudson, 1983). Major recent advances have 
been made in this area. The full likelihood is quite daunting to compute, but 
Lyngs0 et al. (2008) have a parsimony-based approach which saves on compu- 
tation by several orders of magnitude. Importance sampling (Griffiths et al., 
2008) is also promising, but is not yet efficient enough for the long alignments 
typically encountered in phylogcnetics. Also, it is the intent of this paper to 
construct a method which is independent of population genetics models such as 
the coalescent. 

A related though distinct line of research is the inference of recombination 
breakpoints. One of most basic and most commonly used methods for the infer- 
ence of recombination breakpoints is called "bootscanning" , whereby a window 
is scanned along the alignment and a phylogenetic tree is built for each posi- 
tion of the window; a change in topology between sections of the window can 
be interpreted as evidence for a recombination breakpoint (Lole et al, 1999). 
There are many different variations on this theme. One promising line of re- 
search by Marc Suchard and collaborators apply multiple change-point models 
and reversible-jump MCMC to estimate trees and model parameters along the 
alignment (Suchard et al., 2003; Minin et al., 2005). We also note that some- 
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times recombination breakpoints can be seen "with the naked eye" as in Thom- 
son et al. (2001). In contrast to our paper, it is not the intent of these methods 
to accurately infer phylogeny; furthermore they do not posit any relationship 
between trees in neighboring no-recombination blocks. Furthermore, some of 
the more computationally intensive methods actually require a fixed reference 
tree. 

In summary, we are not aware of any available method for building reticulate 
networks which gives an explicit rooted phylogcnctic history for each column of 
the alignment, which elucidates the balance of discord between the trees and 
optimality for those trees, and which is efficient enough to be useful for mod- 
ern data sets. The lack of practical phylogenetic algorithms in the presence 
of recombination was recently demonstrated in a simulation study by Woolley 
et al. (2008). Huggins and Yoshida (2008) have noted the lack of useful recon- 
struction algorithms for host-parasite relationships and have noted the need for 
an algorithm which balances tree concordance and optimality as constNJ does. 
Although far from a complete solution for these cases, we believe that constNJ 
is a first step in the right direction. 

2 General description of constNJ 

The primary input for constNJ is a collection of alignment blocks, which as 
described are disjoint subsets of columns of the alignment which are assumed 
to evolve in a tree-like manner. In the case of alignments with recombinant 
sequences, the alignment blocks are simply the no-recombination blocks. Note 
that the alignment blocks need not be contiguous; for example a single recom- 
bination event with two recombination breakpoints will result in two, not three, 
alignment blocks. The other input for constNJ is a sequence of constraints 
on the rSPR distance between the trees constructed for the no-recombination 
blocks as described below. Given this input, the goal of constNJ is to exhibit 
the balance between discordance among the alignment-block trees on one hand, 
and optimality of the trees in some phylogenetic sense on the other. 

constNJ is a deterministic distance-based approach to reconstruction; we 
chose this direction for several reasons. First, the underlying space for a likeli- 
hood optimization scheme is even larger than usual, making a heuristic search 
even less appealing: there are [(2n — 3)\l] k fc-tuples of rooted bifurcating phy- 
logenetic trees on n taxa. The sorts of constraints we will be imposing reduces 
this number substantially, but little is known about the resulting graph under 
the sorts of moves typically used in heuristic phylogenetic searches. Further- 
more, likelihood-based approaches are substantially improved by starting with a 
reasonable tree, which in modern applications is typically a distance-based tree. 
Thus, even if a likelihood-based approach was the eventual goal, a distance- 
based approach would be useful as a "seed" for the heuristic likelihood search. 
Finally, we feel that distance- and likelihood-based algorithms occupy distinct 
and complementary roles in the world of computational phylogenetics. 

Our goal is to design an approach which generalizes the remarkably accurate 
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and hugely popular neighbor-joining algorithm (Saitou and Nci, 1987). Remark- 
ably, it took almost 20 years for the phylogcnctics community to learn the objec- 
tive function of neighbor-joining; during that time it was even claimed that no 
such objective function existed. However, it is now known that neighbor-joining 
greedily optimizes the "tree length" £(T, D) (defined below in Equation 1) for 
the given distance matrix D. constNJ generalizes this objective function, as it 
attempts to minimize the total length of all of all k trees (2) by a combination 
of greedy steps. 

The trees resulting from constNJ are constrained by the user to be some 
specified number of rooted subtree-prune-regraft (rSPR) moves from one to an- 
other. As displayed in Figure 1, reticulation events such as recombination and 
hybridization correspond to rSPR tree rearrangements. The converse is not true: 
arbitrary rSPR tree rearrangement events need not correspond to reticulation 
events. For recombination or hybridization to take place, the participants in the 
event need to exist at the same time; it is not hard to set up examples of rSPR 
move combinations which violate this fact (see, e.g., Song and Hcin, 2005) . 
Methods have been developed which take timing restrictions into account (Song 
and Hein, 2005; Bordewich and Semple, 2007a), but we do not incorporate these 
ideas into a phylogenetic reconstruction framework. This may be an interesting 
avenue to for future research, but on the other hand seeing such timing viola- 
tions can actually be informative. First, there may be something wrong with 
the data. Second, it has been noted (Baroni et ai, 2006) that reticulation net- 
works can appear to violate timing constraints if certain taxa are not sampled. 
The problem of determining the minimal number of "missing" taxa required to 
explain timing constraints has been analyzed by Linz et al. (2008). Therefore 
we have left interpretation of timing issues up to the user of the program. 

We now make a more formal statement of the problem constNJ attempts to 
solve; note that a similar formulation was made independently by Huggins and 
Yoshida (2008) in the context of host-parasite relationships. 

Problem 1 (rSPR-constrained balanced minimum evolution). Given k n x n 
distance matrices D\, . . . , D k and a symmetric kxk constraint matrix C, find the 
set of trees 7\, . . . , T k minimizing Y^i=i ^(^i> £>i) such that G? r sPR(Ti> Tj) < Cij 
for each i and j . 

Theorem 2. constNJ is a consistent algorithm to solve Problem 1. 

For constNJ we proceed in a manner analogous to that for neighbor-joining. 
The neighbor-joining algorithm starts with all taxa connected to a central node, 
then at every stage, chooses the "coalescence" (in other papers, "amalgamation) 
of trees which most decreases the value of the total tree length. We mimic this 
philosophy by evaluating coalescences based on how they affect the total tree 
length. However, in the end we must come up with a collection of trees T\ , . . . , T k 
that satisfy the prescribed rSPR constraints. This raises the question of how one 
might bound the rSPR distance of the eventual trees "ahead of time," i.e. before 
the termination of the coalescence steps. For instance, if in the developing trees 
one has the subtrees (a, b) for the first distance matrix, and (a, c) for the second 
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Figure 4: Schematic diagram of the constNJ algorithm. As described in the 
text, at every stage we attempt to find the optimal pair of partially coalesced 
trees which could eventually be at most some fixed number of rSPR moves 
apart. As shown in Theorem 19, the m value for a pair of partially coalesced 
trees forms a sharp lower bound for the eventual rSPR distance between those 
trees. Therefore pairs of partially coalesced trees which have m value exceeding 
the constraint on rSPR distance can be thrown out, as shown by the X. 

distance matrix, it is clear that the resulting trees must have rSPR distance at 
least one between trees Ti and T 2 . 

The question of how to bound eventual rSPR distance is solved by Theo- 
rem 19. Specifically, we generalize m, the size of the maximum agreement forest 
(Bordewich and Semple, 2005) to these partially coalesced trees, which forms 
a sharp bound. In short, the m value for a pair of partially coalesced trees 
T and S is the minimum rSPR distance possible among trees resulting from 
coalescences of T and S; thus once a pair of partially coalesced trees achieves 
an m value above the corresponding constraint, we can throw that pair out, as 
the eventual resolved trees will never satisfy the constraints. 

Using this m we construct our greedy algorithm, as shown in Figure 4. Say 
that we only have two trees, and that we want to find the minimal-total-length 
pair of trees which are only one rSPR move apart. At every stage, we attempt 
to find the best pair of partially coalesced trees with m values zero and one. 
We start with two star trees; m applied to this pair is zero. The first-step 
coalescence must also lead to a pair of trees which have m value zero, as one 
of the trees is still completely unresolved. Say the optimal, in terms of total 
tree length, second step NJ-type coalescence leads to a pair of trees which have 
m value one (indicated by the first diagonal arrow in Figure 4). Then we go 
down the list of second-step coalescences for the trees, and find the best one 
which does not increase the m value at all (indicated by a horizontal arrow in 
Figure 4). Next we repeat the process for each of the trees from the previous 
stage, saving the best pair of trees which have m values zero and one. In the 
end, we will have the best pair of trees which have rSPR distances zero and one 
which were achievable via a series of greedy steps. Although not guaranteed to 



10 



be the optimal pair of trees, the algorithm is consistent. 

3 Technical preliminaries 

In this section we review some definitions and clarify our notion of optimality. 
As stated in the introduction, we will always assume that an outgroup taxon has 
been chosen, and will label it p. Thus we always assume that p is contained 
in any taxon set A. We will use the following definitions. For the purpose of 
this paper, a tree on a finite taxon set A will be a rooted binary phylogenetic 
X-tree. A forest on taxon set X will be a collection of trees on disjoint taxon 
sets such that the union of the taxon sets is X. We will sometimes consider a 
tree on A to be a forest with a single tree. An unrooted tree on a finite taxon 
set X will be an unrooted phylogenetic A-tree (note that unrooted trees will be 
allowed to have multifurcating nodes.) C(R), £(R), and V(R) will denote the 
leaves, edges, and vertices of a tree, unrooted tree, or forest R. 

Although p represents the true rooting of the phylogenetic tree, we will not 
always assume that our trees or forests are rooted at p. We must do so 
because the NJ-type coalescences will not in general root the tree at the edge 
leading to p. Therefore, we must allow alternative roofings, but at the same 
time keep in mind that the rSPR distance between the trees must be calculated 
with respect to the edge leading to p. Thus we use the following definition of 
rSPR on an unrooted tree: given an unrooted tree U on a taxon set A 3 p, a 
single SPR move first cuts some edge of the tree except for that leading to p, 
resulting two rooted trees R and S. Say p G C(R). Suppress the degree two 
root node of R, and attach S to some edge of the resulting unrooted tree by 
inserting a degree two node onto the chosen edge, then connecting the root of S 
to that new node. This definition is the same as that of Bordewich and Scmplc 
(2005) when considering trees rooted at the edge leading to p. 

As with any distance function defined implicitly in terms of a graph, the 
minimum number of rSPR moves required to transform one tree T into another 
5* is a metric; we define d r $Y>-R.{T, S) to be this number. 

3.1 Tree length and the balanced minimum evolution cri- 
terion 

As reviewed by Gascuel and Steel (2006), phylogenetics researchers now under- 
stand the optimality function of the neighbor-joining algorithm (Saitou and Nei, 
1987). Let p(i, j) denote the path from i to j in the unrooted tree T, and define 
the weight of a path from leaf i to leaf j as 

w{i,j)= n - \ -• 

11 deg (v) - 1 
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Then the "length" of an n taxon tree T with respect to an n x n distance matrix 
D is (Semple and Steel, 2004): 

£(T,D) = ^2w(i,j)D iij . (1) 

The name "tree length" comes from the fact that if D is a distance matrix 
derived from some assignment of branch lengths to the edges of T, then I will 
be the total length of all of the edges. However, the name may be somewhat 
confusing initially, because I need not be defined as sum of the branch lengths 
of any specific tree. 

The tree T which minimizes £(T, D) for some distance matrix D is known 
as the balanced minimum evolution (BME) tree for the distance matrix D. The 
BME criterion is consistent (Desper and Gascuel, 2004), and neighbor-joining 
is a consistent trec-building heuristic which greedily minimizes total tree length 
(Desper and Gascuel, 2005) As described in Problem 1, constNJ attempts to 
minimize 

k 

(2) 

t=l 

while enforcing pairwise constraints on the rSPR distance between pairs of trees. 
When k = 1 constNJ is simply neighbor joining, while for k > 1 constNJ is a 
strict generalization of NJ. 

4 Rooted SPR and maximum agreement parti- 
tions 

This section describes the primary technical content of this paper. As described 
in the introduction, we would like to proceed via coalescences in a manner similar 
to neighbor-joining, while ensuring that the eventual rSPR distance between 
the trees is not too large. In order to assure adherence to the rSPR criterion, 
we develop the notion of maximum-agreement partition, which generalizes the 
notion of maximum agreement forest from Bordcwich and Semple (2005). As 
shown in Theorem 19, maximum agreement partitions and the associated m 
value allow us to bound the rSPR distance between the two partially resolved 
trees "in advance." 

4.1 Compatibility and coalescence 

We will use the following definitions. A split on a taxon set X is a bipartition of 
X. Because the set X will be clear, we will often abuse notation by identifying 
A C X with the partition A\(X \ A). Furthermore, because we have a special 
element p, we can distinguish between the two sides of a split; the side not 
containing p will be called the rsplit (short for rooted split) of the split. It is 
clearly equivalent to describe a given partition in terms of a split or an rsplit, 
and we will use the two descriptions interchangeably. 
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Note that the neighbor-joining algorithm is typically thought of as proceed- 
ing by coalescing internal nodes of an unresolved phylogenetic tree (see Figure 4); 
however for our purposes it will sometimes be easier to consider the forest ob- 
tained by deleting the central node and the associated edges. The opposite 
construction will be called "starification." 

Definition 3. Given a forest F, define the starification ~k(F) of F as the 

following unrooted tree. If F has one tree, then suppress the degree two root 
node of F. If F has two trees, then join their root nodes by an edge. If F has 
three or more trees, join all of the root nodes of trees of F to a single node. The 
new introduced node will be called the star node. 

We will identify any one, two, or three tree forest F with its starification, in 
which case there is no designated star node. 

Definition 4. Given a tree T which is part of a forest F on a taxon set X , 
define the edge splits Eb(T) to be C(T)\[C(T)] C along with the set of splits on 
X induced by the edges ofT. We define Sb(F) to be the union of the edge splits 
ofT across all trees T in F. 

For example, the rsplits {3} and {2,3,4} are both edge rsplits of the forest 
((l,p),2);(3,4). 

Definition 5. Given a forest F on a taxon set X , A is a separating split of F 
if A is the union of taxon sets for a collection of at least two trees of F. The 
set of separating splits of F will be denoted T,s(F). 

Given a forest F we will write E(F) for T. E (F) U S S (F). This will be the set of 
splits used to make agreement partitions as described below. 

Definition 6. Two rsplits A and B will be called compatible if either A(~]B = 0, 
A C B, or B C A. 

Because A and B are the sides of the splits which do not contain p, this is 
the same as the usual criterion for split compatibility (Semple and Steel, 2003). 
Therefore we have the following well known theorem. 

Theorem 7 (Buneman, 1971). A collection of splits M on a taxon set X is 
pairwise compatible iff there exists an unrooted tree T on taxa X such that M 
is a subset o/Sb(T). There is a one-to-one correspondence between compatible 
sets of splits on X and minimally-resolved unrooted trees on X. 

Definition 8. Two forests F and G on taxon set X are compatible ifJ^E{F) 
andY>E(G) are pairwise compatible. 

Definition 9. The join T A S of two trees T and S on disjoint taxon sets is 
the tree obtained by joining the root nodes ofT and S to a new root node. The 
coalescence ofT and S in the forest F is the forest {T A S} U (F \ {T, S})}. 

Note that the operation of coalescence gives a partial order on the set of 
forests on a given taxon set. Namely, we write F > G if F is a coalescence of 
G. Clearly, trees are the maximal elements in this partial order. 
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Definition 10. A tree S is a subtree of an unrooted tree U if S is one compo- 
nent of the disconnected graph obtained by cutting an edge of U. A tree S is a 
subtree of a rooted tree T if S is a component of the disconnected graph obtained 
by cutting an edge of T , and S does not include the root of T . 

We emphasize that the subtree definition is different than than that of an 
induced subtree, which is as follows. The existence of induced subtrees is guar- 
anteed by Theorem 7 or its rooted equivalent. 

Definition 11. Given a tree T and Y C C(T), T\y is the (rooted or unrooted) 
tree on taxa Y with rsplits {A n Y : A e £e(T)}. 

There is also an analogous definition for forests. 

Definition 12. Given a forest F and Y C C(F), F\ Y is 

{T\ Y :T eF and L(T) n Y + 0}. 

Proposition 13. If two forests F and G on a taxon set X are compatible, and 
F has more than one tree, then there exists H >- F such that H is compatible 
with G. 

Proof. If F has two or three trees, the proposition is trivial. Otherwise, let U 
be the tree with split set equal to the union of T,e{F) and Sb(G). If U is not 
resolved (i.e. if there exists an internal node of degree greater than three) then 
take an arbitrary resolution. As all of the trees T of F are resolved, each T sits 
as a subtree of U; let J be the union of the nodes of the T G F (considered as 
nodes of U). Let p denote the longest path in U which does not contact any of 
the nodes in J. Because F has at least four trees, p will be nontrivial. Pick one 
end of this path, which must be connected to a pair of trees 5", S" of F. Let 
K = C(S') U C(S"). As the split K \K C is already a split of U, we know that it 
is compatible with T,e(F) and S^(G), and thus that S(S" A 5"') is compatible 
with T, E {G). Let H be the coalescence of S' and 5"' in F. □ 

4.2 Maximum agreement partitions 

In this section we introduce the notion of maximum agreement partition (MAP), 
which generalizes the idea of maximum agreement forests. Maximum agreement 
forests were first introduced by Hein et al. (1996), and further refined by Bor- 
dewich and Scmple (2005). In broad terms, given two forests F and G on a 
taxon set X, we will be interested in considering partitions P which are obtain- 
able from F and G independently by "combining" edge splits and separating 
splits of those forests, in the same way that edge cuts are combined when mak- 
ing maximum agreement forests. The appropriate notion of "combining" splits 
is the infimum, which we now describe. 

The set of partitions on a given finite set Y form a partial order, such that 
a partition P\ < Pi if P\ is a refinement of P^. In fact, the set of partitions 
is a complete lattice, meaning that any set of partitions on Y has a supremum 
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and an infimum. For a collection of partitions M, we will use inf(M) to denote 
their infimum. 

Thus, as described below, a necessary condition for P to be an agreement 
partition for two forests F and G is that P can be expressed as inf(M) and 
inf(iV) for M C E(F) and N C S(G). It will now be useful to connect that 
definition to one in terms of convexity of characters (Semple and Steel, 2003). 

Definition 14. Given a partition P on some set K, define P\j for some J C K 
to be the partition [Y n J : Y G P}. 

The following is a slight generalization of the definition of convexity given 
by Semple and Steel (2003). 

Definition 15. A partition P on a taxon set X is convex on a forest F on X 
if there exists an H y F such that P induces a convex character on ~k(H), i.e. 
if there exists a partition P on vertices V(if(H)) such that 

(i) P = P\x- 

(ii) Any Y E P separates *k(H) into connected components. 

The following proposition relates the notions of "obtainable by a series of 
cuts along edge or separating splits" with the notion of character convexity. 

Proposition 16. A partition P of a taxon set X is convex on a forest F iff 
there exists M C E(F) such that P = inf(M). 

Proof. Assume M C E(F) such that P = inf(M). Note that K = inf(M n 
^s(F) is a set of disjoint separating or root-edge splits for F; thus we can 
perform coalescences, making H, such that the splits from sets in K are edge 
splits of H. Such an H will satisfy the criteria of the definition. 

For the converse implication, cutting any edge (u, v) of *k(H) for any H >z F 
gives a split in T,(F). We then define M as the set of such splits s UjV such that 
such that (u, v) is an edge and u and v are in distinct sets of the partition P. 
By construction, P = inf(M). □ 

The following definition generalizes the notion of agreement forest. 

Definition 17. We say that a partition P of taxon set X is an agreement 
partition for a pair of forests F,G on X if 

(i) for every pair of rsplits A £ T, E (F), B G E E (G), and Y G P, AC\Y is 
compatible with B PiY. 

(ii) P is convex on F and G. 

We say that P is a maximum agreement partition (MAP) if the number of sets 
of P is less than or equal to that of any other agreement partition. Let m(F, G) 
be the number of sets in the MAP minus one. 
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Note that by Theorem 7, for two resolved unrooted trees U, V on a taxon set 
X 3 p, the size of the maximum agreement partition is the same as the size of the 
maximum agreement forest of the trees (rooted at p) in the sense of Bordewich 
and Semple (2005). Recall that the definitions of maximum agreement forest in 
Bordewich and Semple (2005) differs from that of Hein et al. (1996) and Allen 
and Steel (2001). 

Proposition 18. Assume F, G, and H are forests on a taxon set X such that 
H y F, and P is an agreement partition for H and G. Then P is also an 
agreement partition for F and G. 

Proof. Part (i) of the definition is clear as T,e(F) C Y>e(H). Next we check 
(ii), i.e. that P is convex on F. Note that H y F implies S(-ff) C 12(F), as 
the "extra" edge splits of H will be separating splits of F. By Proposition 16, 
there exists an M C 12(H) such that P = inf(M); by the previous sentence 
M C 12(F) and so by Proposition 16 again P is convex on F. □ 

The following theorem is the main motivation for studying the maximum 
agreement partition. Thus the proposition says that the size of the maximum 
agreement partition of the two forests F and G is the same as the rSPR distance 
in the best case. 

Theorem 19. The minimum of d r spn(U, V) across all unrooted trees U >; F 
and V y G is equal to m{F 1 G). 

The proof of this proposition will come after two lemmas. 

Lemma 20. Given a partition P convex on F and Y G P such that F\y 
includes two distinct trees T and S, then there exist distinct trees T,S G F such 

= T A S. Furthermore, for any Z G P not equal to Y and any 



that 



(f AS*) 



R G {T, S}, we have either Z G C(R) or Z n C{R) = 0. 

Proof. Let H and P be as in Definition 15. Let Y G P be such that YC\C(F) = 
Y. Let f (resp. S G F) be the tree such that f\ Y = T (resp. S\ Y = S). Let 
Q = C(T)UC(S). 

We now show that T ^ S. The contrary would imply if(H)\Q = T\q. 
Because Q C Y and if(H)\ Y is connected by definition, T\q is connected so T 
and S would not be distinct. This is a contradiction. Thus £(T) (l£(S) = so 
(fAS) = f \ Y A S\ y = T A S. 

We now show the second statement of the lemma. Let r(W) denote the 
root node of any tree W G F. Note that r{T) and r{S) must be in Y because 
+ (H)\q is connected and the path between any a G C(T) and b G C(S) passes 
through r(f) and r(S). 

Now assume that for some R G {T, S} we have that some Z ^ Y of P 
intersects £(R) but is not contained in it. Take c G ZC\C{R) and d G ZC\[C(R)] C . 
Let Z G P be such that ZnL(F) = Z. By the same argument as in the previous 
paragraph, r(R) is in Z. This is a contradiction as Y and Z are disjoint. □ 
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Lemma 21. Assume that F and G are forests on a taxon set X , and P is an 
agreement partition for F and G. Then there exist resolved trees U >; F and 
V h G such that P is an agreement partition for U and V . 

Proof. It is enough to show that if one of the forests, say F, has at least four 
trees then there exists an H >- F such that P is an agreement partition for H 
and G. 

If for every Y £ P we have that F\y is a single tree, then we can make Hq by 
taking an arbitrary coalescence of F; any such coalescence will be "broken" by 
P and thus will not introduce any splits violating (i) of Definition 17. Thus we 
assume that F\ Y has at least two trees. By Proposition 13, there exist nontrivial 



Y is compatible with G\y 
f A §] = TAS. Let H 

Y 



T,S € F\ Y such that the coalescence of T and S in F 
By Lemma 20, there exist T and S in F such that 

be the coalescence of T and S in F; the second statement of Lemma 20 implies 
that the coalescence of T and S does not introduce any new edge splits when 
restricted any Z ^ Y in P, and so H satisfies the criterion (i) of a maximum 
agreement partition. 

Also, P is convex on H , establishing criterion (ii). Indeed, by Proposi- 
tion 16, let M C 12(F) be such that P = inf(M); we need to show that 
M C Y,(H ). The only difference between E(F) and E(-ff ) is that E(£T ) 
does not have separating partitions which separate T and S, but M cannot 
contain such a partition because T and S both have taxa in the same partition 
of P. □ 

Proof of Proposition 19. Lemma 21 shows that the minimum of m(U, V) is less 
than or equal to m(F, G). The other inequality follows from Proposition 18. 

Now note that for a resolved tree on X rooted at p, the notions of maximum 
agreement forest and maximum agreement partition coincide. Thus by Theo- 
rem 2.1 of Bordewich and Semple (2005), m(U, V) is equal to the rSPR distance 
between U and V for any resolved U >z F and V >z G. □ 



4.3 Calculating the maximum agreement partition 

As introduced above, and described more clearly below, constNJ needs to find a 
great number of agreement partitions. Indeed, a sample constNJ run with three 
distance matrices, 27 taxa, with pairwise constraints of size two required 5867 
calls to the subroutine finding the size of a MAP. Therefore a speedy calculation 
of the MAP is essential. 

In the present implementation of constNJ, the MAP is calculated is via a 
simple extension of the algorithm by Bordewich and Semple (2005). As with the 
usual Bordewich-Semple algorithm, we contract isomorphic subtrees and replace 
chains of pendant subtrees with chains of three pendant edges. However, we 
consider separating rsplits as well as edge rsplits to find the agreement partition. 

An alternative would be to consider an integer linear programming (ILP) 
approach to the MAP problem based the work of Yufeng Wu, who has re- 
cently developed an ILP approach to finding a maximum agreement forest (Wu, 



17 



2008). Although Wu's ILP approach is many orders of magnitude faster than 
the Bordcwich-Scmplc algorithm for finding the size of the maximum agreement 
forest in the "hard" case when two trees are quite different, our tests have shown 
that it is slower in the "easy" case. This difference is probably because there 
is overhead to creating the linear programming matrix, which does not scale 
strongly with respect to the difficulty of the problem, while the Bordewich- 
Semple algorithm is very fast for easy problems. It is possible that some of 
the ILP overhead could be amortized by clever re-use of portions of the matrix 
across coalescences, or a combination of Bordewich-Semple and Wu ideas, but 
we have not followed these directions. 

5 The constNJ algorithm 

Assume constNJ is given k distance matrices on a taxon set X. On the way 
to constructing our trees T\ , . . . , on X we will be constructing collections 
of forests F = F\, . . . , i^; we will call such a collection F an "instance." For 
example, each boxed pair of trees in Figure 4 is an instance (after deleting the 
central "star" nodes). The agreement profile for an instance F is the k x k 
matrix a(F) where a(F)ij is m(F i} Fj). It describes the degree to which the 
forests agree. The identical agreement profile is the k x k zero matrix. Define 
the instance tensor to be a partially filled tensor of instances indexed by N fc , 
where F is stored in the "slot" indexed by its agreement profile a(F). 

Algorithm 22 (constNJ). Given nxn distance matrices D\,...,Dk and a 

k x k constraint matrix C, 

1. Let F(°) be the trivial instance, i.e. F- ^ is the trivial forest on n taxa for 
each 1 < i < k. Let FL^ be the instance tensor containing only F^°K 

2. Repeat the following until termination: 

a. Let H be the instance tensor from the previous step. 

b. Rank all possible coalescences of all of the instances of H by how 
much they will decrease total tree length. 

c. Make a "step " by walking down this ranked list in order as follows: 

i. Perform the chosen coalescence, say of an instance F, and as- 
sume that the resulting instance F' has agreement profile X. 

ii. Lf some entry of X is greater than the corresponding element of 

C, discard F' and test the next coalescence. 
Hi. Lf not, and F' is the first in this step to have agreement profile X, 
then save it. If, on the other hand, another instance has already 
been found in this step with agreement profile X , then discard F' 
as it must have a larger total tree length. 

iv. Stop walking down the list if X is the identical agreement profile. 

d. Terminate if each of the Fi have three trees or fewer. 
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We now show that this algorithm is consistent. 

Proof of Theorem 2. In broad terms, Algorithm 22 is consistent because of the 
consistency of neighbor-joining (Gascuel, 1997; Bryant, 2005) and because the 
coalescence which most decreases the total tree length must be a neighbor- 
joining step (Desper and Gascuel, 2005). We are given a sequence of distance 
matrices D\, . . . , Df. and a symmetric k x k constraint matrix C. By hypothesis, 
these distance matrices come from a sequence of trees Ti, . . . , T k such that the 
rSPR distance between Tj and Tj is bounded above by Cij. First, by the consis- 
tency of neighbor-joining, NJ applied to each distance matrix independently will 
recover the correct collection of trees. Say the sequence of neighbor-joining coa- 
lescences making Tj gives a series of forests i*i,t, . . . , F n _i t i, where F n -\^ — Ti. 
Thus by Theorem 19 (more specifically, Proposition 18) and our assumptions 
about the Ti, 

m{F a ,i,F btj )<Cij (3) 

for any 1 < a,b < k and 1 < i,j < n. Thus the constraints will always be 
satisfied as long as we follow the sequence of NJ steps for each tree. 

Next we show by induction that given this data, at every step every constNJ 
forest will one of the F r j for 1 < r < n and 1 < j < n — 1. This is clearly true 
at initialization. By induction, assume the assertion is true at some constNJ 
step. Consider the coalescence which decreases total tree length as much as 
possible irrespective of constraints; say it occurs in Fkj. As the coalescence 
decreases the tree length of Fk,i compared to other coalescences of Fk,i, is also 
a neighbor-joining step for Fk,%, making Fk+i,i- By the previous paragraph, 
we know that this coalescence will preserve the constraints, and thus is also a 
constNJ step (recall that each constNJ step decreases the total tree length as 
much as possible amongst coalescences which preserve the constraints). Thus 
at the end we get F n _i^ for each i by induction. Because F n _i ti = T i} constNJ 
is a consistent algorithm. 

□ 

5.1 Implementation 

We have implemented constNJ in the fast functional/imperative language ocaml 
(Leroy et at, 2007). The implementation has a simple command line interface, 
which is documented in the accompanying manual. It is available for download 
from the author's website, at http://www.stat.berkeley.edu/-matsen/constNJ/ 

As described above, the primary input for constNJ is a series of distance 
matrices, with one for each alignment block. The program is designed to ac- 
cept distance matrices from the DNADIST program of the PHYLIP package, 
although longer lines and taxon names are allowed. The first taxon is assumed 
to be the outgroup. The program assumes that the taxa in the distance matrices 
are ordered in a corresponding way. For instance, if one is using constNJ to 
investigate recombination, all of the taxa should be listed in the same order, so 
that the taxa in the no-recombination blocks correspond to one another. On the 
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other hand, if one is using constNJ to investigate host-parasite relationships, 
the ith taxon in the parasite alignment should parasitize the ith taxon in the 
host alignment. If, for example, a given parasite is present in multiple hosts, 
this will require duplication of that parasite sequence in the alignment. 

The second input for constNJ is a set of constraints for the resulting corre- 
lated set of trees. There are two options for specifying these constraints: first, 
via a file, or second, by enforcing "linear" constraints. For example, assume we 
supply three distance matrices: Dq, D\, and £> 2 , and would like to construct 
trees T , T\, and T 2 . To specify constraints for these matrices, one writes one 
constraint per line, with first the indices of the distance matrices then the num- 
ber of rSPR moves allowed between those distance matrices. For example, a 
line saying 2 1 would mean that T and T 2 are constrained to be one rSPR 
move apart. On the other hand, one may specify a linear constraint with a 
linear constraint parameter. If the linear constraint parameter is L, then trees 
Ti and Tj are constrained to be L ■ \i — j\ rSPR moves apart. So if we apply a 
linear constraint with parameter 2 in our example, then both To and T\ and T\ 
and T 2 are constrained to be at most 2 rSPR moves apart, while T and T 2 are 
constrained to be at most 4 rSPR moves apart. 

The output for constNJ is collection of correlated sets of trees, each of which 
get their own . tre file, along with a . lengths file, which describes the total tree 
length for each of these sets of trees. constNJ returns at most one correlated set 
of trees for each agreement profile within the constraints, which is labeled by 
the agreement profile. If the constraints are given in a file, then the agreement 
profile is written in the order given in the file. If linear constraints are given, 
the agreement profile is written as a vector representing an upper triangular 
matrix in the usual way. For example, the agreement profile for three trees with 
linear constraints is written (d rS pR(T , Ti), c? r sPR(T , T 2 ), rfrSPR^i, T 2 )), so the 
set of trees in the file example . 2_1_1 .tre has agreement profile (2,1,1). The 
. lengths file contains the information on tree lengths, as in Table 1 of the 
introduction. Namely, for each correlated set of trees returned by constNJ, it 
displays the total tree length for those trees. 

5.2 Speed 

A rigorous worst-case runtime analysis of constNJ would show that it can be 
incredibly slow. Indeed, the maximum agreement partition is a generalization 
of the maximum agreement forest; thus finding the size of the MAP is NP- 
hard by the corresponding theorem by Bordewich and Scmple (2005). However, 
constNJ docs not just need to solve one such problem, it needs to solve quite 
a number of them. At worst, constNJ would need to find as many MAP's as 
there are possible coalescences, just for a single step and a single instance; if 
an instance had forests with £\, . . . , £k trees, then there will be (^) x • • • x ) 
possible coalescences, each of which in theory could require solving of a MAP 
problem. At any step there can be as many instances as there are agreement 
profiles satisfying the constraint matrices, and a problem with n taxa and k 
distance matrices will require nk such steps. Such an analysis would not give a 
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very clear understanding of the practical time requirements of running constNJ. 

In practice, constNJ can be used effectively for a moderate number of taxa 
and a small number of closely-constrained trees. The running time depends 
somewhat on the number of taxa, but quite a lot on the constraints and num- 
ber of distance matrices. Indeed, the main bottleneck is the MAP calculation, 
and the running time of the MAP calculation depends very strongly on the 
constraints and the number of distance matrices. 

However, what may be surprising is how much the running time depends on 
the quality of the data. This is vividly illustrated by the simulations, where in 
the case of two trees with two reticulation events and divergence of 0.1 mutation 
per site per tree, the sequences with 100 sites took on average 10.3 minutes to 
run, while the simulations with 6400 sites took on average 0.68 seconds each. 
This represents a difference of almost three orders of magnitude. On the same 
processor (Intel ® Xeon ® CPU at 2.33GHz) using real HIV data, an example 
with three 38-taxon distance matrices and pairwise constraints of three for each 
pair of distance matrices took 49 seconds, while an example with only two 40- 
taxon distance matrices with a single constraint of size three took almost 21 
minutes. The quality of the data impacts "how far" constNJ has to go down 
the list of coalescences in order to find one with the desired agreement profile, 
and how often it needs to calculate a new agreement partition. 

We have made some coding choices to increase the speed. For example, 
there is a natural partial order on agreement profiles, which is just the element- 
wise numerical order. In considering which coalescences to perform, we only 
investigate those coalescences which could lead to an agreement profile which is 
smaller than those which have already been performed. In principle, one could 
do a more comprehensive search which might lead to more optimal sets of trees; 
we have not found a significant improvement following such a direction. 

6 Simulations 

In order to evaluate the performance of constNJ, we performed a number of 
simulations. The trees in the study were generated as follows. We choose the 
number of trees in the recombination network, say k, the size of the trees, say 
n, and a number of rSPR moves, say m. We start with a tree Ti drawn from 
the Yule distribution of trees on n taxa. After choosing the desired expected 
number of substitutions on the tree (in simulations below, 0.1, 0.5, and 2), we 
divided this number by the number of non-root edges in the tree to get the 
expected number of substitutions per edge. We then drew the actual number of 
substitutions per edge from the exponential distribution with the corresponding 
mean to get the branch lengths of Ti . 

We then generated Tj + i from Tj by applying k rSPR moves to Ti as follows. 
For each rSPR move, first select a non-root edge uniformly; call the subtree 
below the chosen edge S. Cut off S then glue it back in on a uniformly selected 
edge of Ti not contained in S. The location along the chosen edge to attach S 
is chosen uniformly. Then to simulate differential rates of evolution of different 
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regions, take the average of the previous branch length and a branch length 
drawn from an exponential distribution as before. 

Given such a series of trees Ti, . . . , Tk, we generated a collection of distance 
matrices D\, . . . , Dk by simulating sequences on the trees. We did so using the 
Jukes-Cantor model of sequence evolution with a single rate. Distances were 
then calculated using the Jukes-Cantor distance correction (Felsenstein, 2004a). 
In case the Jukes-Cantor correction gave an undefined value, we repeated the 
analysis with a new sequence. We chose the simple Jukes-Cantor model to focus 
attention on our method rather than the distance estimator. 




Figure 5: constNJ simulation results for two trees, each on 30 taxa, averaged 
over 400 replicates. The first tree was drawn from the Yule distribution, and the 
second tree was made by applying a random rSPR move to the first. "constNJ" 
is our algorithm, "concatenated NJ" is neighbor-joining run with a concatenated 
alignment, and "independent NJ" is neighbor-joining run independently on the 
alignments for the different trees as described in the text. 

For the first set of simulations, we wanted to understand how the topolog- 
ical accuracy of constNJ compares to that of concatenating alignment blocks 
or running them independently. For concatenation, we estimated a single dis- 
tance for each pair of taxa by taking the Jukes-Cantor correction of the average 
number of substitutions in each alignment block; such a procedure simulates the 
process of concatenating equal- length alignment blocks. We then considered the 
resulting tree as the output of running NJ on the concatenated alignment for 
each alignment block. For independent construction, we simply ran NJ on each 
distance matrix independently. For constNJ, we constrained the rSPR distance 
between the trees to be less than or equal to the number of rSPR moves used 
to generate the trees. The trees used in the comparison were then the shortest 
(i.e. smallest total tree length) trees returned given those constraints. 

To measure topological accuracy, we used the Robinson-Foulds distance 
(Robinson and Foulds, 1981), which is simply one half the size of the symmetric 
difference of the edge split sets. The results are shown in Figures 5, 6, and 7. 
In these simulations, constNJ typically outperforms either alternate strategy. 
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Figure 6: constNJ simulation results for two trees, each on 30 taxa, averaged 
over 400 replicates. This time, two rSPR moves were applied to the first tree to 
get the second. 

When sequences are short, the main source of error is insufficiently accurate 
distance estimations; concatenation increases the amount of useful sequence 
information for distance estimation, and so outperforms independent construc- 
tion in that case. However, constNJ does almost as well. On the other hand, 
when sequences are long, independent estimation does well, as there is enough 
sequence information to reconstruct the tree for each block independently. In 
that case, constNJ also does well. 

The reader may object that these graphs represent an unfair comparison, 
as they assume that the number of reticulation events is correctly bounded in 
advance. The next two simulations address this objection. The first set, with 
results shown in Figure 8, seems to indicate that by looking at the . lengths 
file one can do a reasonable job of deciding how many rSPR moves to allow as 
was done in the example case of the introduction. The second set, with results 
shown in Figure 9, concerns what happens if one makes an incorrect decision. 

Figure 8 explores one of the main themes of this paper, which is the trade-off 
between phylogcnctic optimality (in this case total tree length) and congruence 
among individual trees. To make this figure, we generated pairs of trees as 
before, generating a Yule tree and then applying some number of rSPR moves 
to get the second tree, except that this time we threw out pairs of trees which 
did not have the correct rSPR distance between them (i.e. when a subtree was 
moved back to its original location). We drew branch lengths as above then 
simulated 1000 sites with an expectation of 0.5 mutations per site per tree. The 
x-axis is the index of the . lengths file, i.e. the number of rSPR moves between 
the two reconstructed trees. The left y-axis, "average total length" , shows the 
average length of trees with that number of rSPR moves between them. For 
instance, consider the point on the line labeled "three rSPR moves" which is at 
x- value 2. This says that if we simulate a pair of trees which are three rSPR 
moves apart as described above, then we expect the pair of trees output by 
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0.1 substitutions / site / tree 



0.5 substitutions / site / tree 



2 substitutions / site / tree 




Figure 7: constNJ simulation results for three trees, each on 30 taxa, averaged 
over 400 replicates. Here one random rSPR move was done to change the first 
tree to the second tree, and the second tree to the third. 



constNJ with agreement profile two to have total length about 1.038. Note 
that constNJ does not always return a tree for every agreement profile which 
is allowed under the constraints. In those cases, we simply took the total tree 
length from the largest non-empty agreement profile. 

Figure 8 shows exactly what one might expect. Namely, if we generate pairs 
of identical trees, then not much improvement in terms of total tree length is 
gained by allowing the trees to differ. However, if the trees are one rSPR move 
apart, then there is a substantial drop when allowing one rSPR move, but not 
much more after that; this indicates that only one rSPR move is called for by 
the data. The situation is similar for the other numbers of rSPR moves. Thus, 
at least in simulation with good quality data, it appears that one should be able 
to make a reasonable judgment as to the correct number of rSPR moves for the 
data set at hand, as was done in the introduction. 

We also performed some simulations allowing an incorrect number of rSPR 
moves (Figure 9). As shown there, giving a too-small constraint interpolates 
between results from concatenated data and the correct specification, while too- 
large constraints give performance similar to the correct constraint. One might 
expect constNJ with too-large constraints to give results similar to independent 
NJ; we do not have a clear explanation why this is not the case. 



7 Conclusion 

In this paper we present constNJ, a consistent distance-based algorithm for 
a collection of trees with pairwise rSPR constraints, such as those constraints 
satisfied by collections of trees which fit into a reticulation network. constNJ is 
deterministic and a strict generalization of the neighbor-joining algorithm. In 
order to ensure that the resulting set of trees satisfy the specified constraints 
on rSPR distance, we develop the theory of maximum agreement partitions, 
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Figure 8: Comparison of the total tree lengths for simulated trees differing by 
the described number of moves and then reconstructed using constNJ; average 
of 100 simulations. As can be seen, the most significant decreases in the total 
tree length happen when getting to the correct number of rSPR moves, after 
which the plot levels off. For example, on the line "two rSPR moves," there 
are significant decreases in length when going to one and two rSPR moves, but 
not much decrease after that. Thus, at least in simulation, it appears possible 
to make a reasonable choice concerning the number of rSPR moves to allow 
between the two trees. 
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Figure 9: Comparison of various specified constraints for constNJ; average of 
100 simulations. Data was simulated on two trees, each on 30 taxa, such that 
two random rSPR moves were done to change the first tree to the second tree. 
Then reconstruction was done with rSPR distance constraints of 0, 1, 2, 3, and 
4. As would be expected, having a constraint of (identical trees) has quali- 
tative performance similar to that of concatenated NJ, the best performance is 
obtained by the correct constraint of 2 moves, and a constraint of 1 gives results 
between those for and 2. The performance of 3 is similar to that for 2, while 
4's performance degrades with accurate distances. 



culminating in Theorem 2. 

We hope that this algorithm is the beginning of a new direction for phyloge- 
netic inference of reticulation networks. We simplify the problem considerably 
by assuming that the alignment blocks are known in advance; in doing so we 
preserve the correlation between sites in the alignment with the same history. 
Rather than first finding trees and then attempting to put them into a recombi- 
nation network, we investigate the balance between discordance between trees 
and optimality of the ensemble of trees. By using trees whose rooting is derived 
from outgroups, we find explicit evolutionary histories. 

To our knowledge, constNJ is the first algorithm of its kind, and there is 
considerable room for improvement over this first attempt. First, we enforce 
pairwise bounds on the rSPR distance between trees, which is a relatively weak 
way to show that these trees fit into a network. A more explicit approach would 
be desirable. Second, distance-based methods waste a considerable amount 
of information which is used by likelihood-based methods; a logical next step 
would be to create a likelihood-based method. Doing so would require a col- 
lection of "moves" analogous to rooted nearest-neighbor-interchange or rSPR 
moves for a heuristic search, but for collections of trees, with the constraint 
that the moves don't change the rSPR distance between trees too much. One 
option would be to explicitly store a reticulate network in memory and have 
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the trees moving about inside the reticulate network while the network changes. 
Such an algorithm would do a better job of actually reconstructing a reticulate 
network. Third, an alternative direction for heuristic optimization might be to 
do a more complete search for the minimum length tree in a manner analogous 
to algorithms searching for the BME tree. Fourth, a more immediate issue 
is that constNJ does not reconstruct branch lengths. It would be possible to 
do a distance-based branch length estimation in a manner similar to that for 
usual neighbor-joining, but the fact that we are choosing trees which may be 
sub-optimal according to the NJ criterion implies that negative branch lengths 
might be encountered. Alternatively, one might use a program such as PHYML 
(Guindon and Gascuel, 2003) to estimate branch lengths on each fixed topology 
independently. This appears to be a reasonable way to proceed, despite the 
fact that the correlation between branch lengths of the different trees is lost. A 
more correct approach will require some sort of correlation of the branch lengths 
in a model-based manner, and we believe that such reconstruction is probably 
best done in the context of a complete likelihood-based approach as described 
above. Finally, it might be interesting to incorporate some constNJ ideas into 
bootscanning-type methods for recombination breakpoint inference. 

Eight years ago, Kuhncr et al. (2000) wrote "[w]hcn recombination occurs 
adjacent sites may have different, although correlated, genealogical histories. 
Reconstructing these genealogies with certainty is impossible." Although we do 
not claim certainty for this (or any forthcoming) algorithm attempting to recon- 
struct reticulate phylogenetic history, we think that there is cause for optimism 
and look forward to seeing future developments in this area. 
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Abstract 

This paper introduces constNJ, the first algorithm for phylogenetic 
reconstruction of sets of trees with constrained pairwise rooted subtree- 
prune regraft (rSPR) distance. We are motivated by the problem of con- 
structing sets of trees which must fit into a recombination, hybridization, 
or similar network. Rather than first finding a set of trees which are op- 
timal according to a phylogenetic criterion (e.g. likelihood or parsimony) 
and then attempting to fit them into a network, constNJ estimates the 
trees while enforcing specified rSPR distance constraints. The primary 
input for constNJ is a collection of distance matrices derived from se- 
quence blocks which are assumed to have evolved in a tree-like manner, 
such as blocks of an alignment which do not contain any recombination 
breakpoints. The other input is a set of rSPR constraints for any set of 
pairs of trees. constNJ is consistent and a strict generalization of the 
neighbor-joining algorithm; it uses the new notion of "maximum agree- 
ment partitions" to assure that the resulting trees satisfy the given rSPR 
distance constraints. 
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1 Introduction 



Since the pioneering paper of ?, tens of thousands of papers have been published 
on the subject of "reticulate evolution." "Reticulate evolution" has generally 
come to mean evolution where genetic material for a new lineage may come from 
two or more sources, as in the case of recombination and hybridization. The 
Oxford English Dictionary (1989) defines "reticulated" to mean "constructed 
or arranged like a net; made or marked so as to resemble a net or network." 
Correspondingly rather than evolutionary history being representable as a tree, 
a network is more appropriate. A considerable amount of effort has gone into 
the phylogenetic reconstruction of these networks. 

Algorithms for phylogenetics in the presence of reticulation have followed 
a curiously different path then the mainstream of phylogenetics. As surveyed 
below, current algorithms fall into three types: first, there are algorithms which 
attempt to find the phylogenetic network displaying some fixed characteristics 
(such as splits in an alignment or some set of trees) which contain the mini- 
mum number of reticulation events. Secondly, there are algorithms to construct 
"splits networks," which do an excellent job of representing conflicting signals in 
the data, but do not give an explicit evolutionary history. The third approach 
is to sample from the posterior distribution of a population-genetics model, 
such as the coalescent with recombination. None of these approaches furnish a 
practical solution for certain cases, such as HIV researchers who would like to 
reconstruct the evolutionary history of an alignment which includes recombi- 
nant sequences. Indeed, first fixing a set of characteristics and then minimizing 
the number of reticulation events ignores the balance between number of retic- 
ulation events and phylogenetic optimality, the splits network approach does 
not tell a complete evolutionary story and population-genetic algorithms are 
are not yet sufficiently fast for DNA sequence datasets which have thousands 
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of nucleotides. As there are no algorithms which are practical for doing phylo- 
genetic reconstruction in this setting, HIV researchers who wish to reconstruct 
evolutionary history typically proceed in one of two ways: they either treat the 
whole alignment as having a single tree-like history which cannot possibly be 
correct, or they build trees on sub-alignments independently, which does not 
take into account the underlying network structure. These two extremes, of as- 
suming all trees have the same topology or allowing their topologies to differ in 
arbitrary ways, leave a substantial gap in the middle, where the correct balance 
of optimality and discord should be found. 

The goal of constNJ is to begin filling this gap in a manner analogous to 
classical phylogenetic inference algorithms. To do so, we make a different set of 
assumptions than has previously been done considering the input and desired 
output. Regarding the data, we assume that that the given alignment has been 
segmented into "alignment blocks", each of which can be described by a sin- 
gle tree. For example, in the case of recombination, the alignment blocks are 
the segments of an alignment which do not contain recombination breakpoints. 
(Note that for the purposes of this paper we will be using the word "recom- 
bination" in the general sense, including processes such as gene conversion.) 
Although the assumption that the data comes pre-segmented is a substantial 
one, we don't think that it is unreasonable. From a practical standpoint, some 
assumption needs to be made, as algorithms which attempt to find a correct 
segmentation of the data and a sequence of trees simultaneously have a difficult 
time searching the complete space. Furthermore, sometimes a segmentation is 
clear, such as the distinct RNA strands of the influenza genome. Other times, 
such as for recombination, it is not so clear, but even in this more difficult case 
the inference of recombination breakpoints has seen significant progress in the 
last 10 years (reviewed below). We will also assume that an outgroup has been 
selected. Such a choice is crucial, as it establishes directionality for reticulation 
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Figure 1: An example "reticulate" network and the two trees that it contains. 
Those two trees are related by a single rooted subtree prune regraft (rSPR) 
move, whereby the middle subtree is cut off of the tree and reattached at another 
location. The node v will be called a reticulation node. 

events. 

Regarding desired output, rather than actually building a single reticulate 
network, this paper will focus on building "correlated sets of trees," which dis- 
play the sorts of constraints found on trees which fit in a reticulate network. 
We are focused on building trees because each alignment block is correctly de- 
scribed by a single tree. However, this set of trees must fit into a network, 
which forces constraints on their topology. Specifically, the trees which sit in 
these networks must be related by rooted subtree-prune-regraft (rSPR) moves, 
whereby a rooted subtree is cut from the original tree and then re-attached in 
another location (Figure 1). We describe below how it is necessary for trees 
sitting in a reticulate network to be related by rSPR moves, though this is not 
a sufficient condition. 

For constNJ, we assume that the user can supply a series of constraints 
describing the number of rSPR moves allowable between pairs of alignment 
blocks. For example, if the alignment contains "pure" types and a single class 
of recombinants which are derived from a pair of types, then there should be 
two alignment blocks and the trees for those blocks should be related by one 
rSPR move as in Figure 1. The challenge, then, is to reconstruct a set of trees 
which satisfy the constraints and which together optimize some phylogenctically 
relevant criterion, such as likelihood, parsimony, or balanced minimum evolu- 
tion. Note that constNJ actually constructs a number of such sets of trees, in 
order to display the balance between optimality of the individual trees and the 
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number of reticulation events needed to fit the trees together into a network. 

We now present a motivating example. The CRF14_BG circulating recom- 
binant form (CRF) of HIV is known to be a mosaic of subtype B and subtype 
G viruses, and the breakpoints of the recombination events are known (?). We 
will call the region of the alignment where BG derives from the G subtype the 
"G region," and the region where BG derives from the B subtype the "B re- 
gion." As in ? and all similar papers we could find in the area, researchers build 
trees independently on the no-recombination blocks. We have repeated such an 
analysis in Figure 2, building PHYML maximum likelihood phylogenetic trees 
using the F84 model and rooted using CPZ.CD.90.ANT.U42720 (removed from 
tree for clarity). As one would hope, the trees do indeed show that the BG 
CRF derives one portion of its RNA from the G subtype, and the other part 
from the B subtype. However, there are many more differences between the 
two trees than should occur for an alignment with a single recombinant strain. 
For example, the rooting changes between the two trees, as does the location of 
the C and the F-K clades. Building a recombination network out of these trees 
would lead a number of spurious hypothesized recombinations. 

In contrast, for this dataset constNJ returns a collection of pairs of trees 
displaying the balance between the number of allowed rSPR moves between 
pairs of trees and phylogenetic optimality. This balance is described in the text 
output of constNJ, which is shown in Table 1 for the BG dataset. The first 
column shows the rSPR distance between the two reconstructed trees (in this 
case the G region tree and the B region tree). As described below, the notion 
of optimality for constNJ is total tree length, which is a trivial generalization 
of the balanced minimum evolution (BME) criterion (?). It is displayed in the 
second column for the pairs of trees. Thus the second line states that constNJ 
found a pair of trees which differ by a single rSPR move, and which have total 
tree length about 7.119. The third column just shows the difference between 
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Figure 2: Phylogenetic trees of the pure subtypes of HIV and the BG recombi- 
nant clade constructed independently using the no-recombination blocks of the 
HIV genome. The single letters (e.g. A,B,C. . . ) label clades of subtypes, and 
BG denotes a clade of circulating recombinant forms (CRFs) made from B and 
G subtypes. Tree (a) is built from the "G region," i.e. the region where the 
BG CRF derives from the G subtype, and tree (b) is built from the "B region," 
where BG derives from the B subtype. As noted in the text, although these 
trees do place the recombinant strains in the correct locations, they differ in a 
number of important ways which are not explained by recombination events. It 
is the perspective of this paper that these extra differences represent phyloge- 
netic error, and that accuracy can be improved by constraining the trees to fit 
into a recombination network. 

the second column values between rows. Thus 0.0942 signifies that there is a 
decrease of magnitude 0.0942 in total tree length by allowing a single rSPR 
difference between the two trees. 

In this way we can achieve an understanding of the balance between phylo- 
genetic optimality and number of recombination events. For example, we can 
see that the decrease in allowing a single recombination event is significantly 
greater in magnitude than that for allowing two rather than one. And sur- 
prisingly, allowing nine rSPR moves does not significantly decrease the total 
tree length compared to allowing three. Because the improvement in total tree 
length when allowing one rSPR move is significantly greater than that for any 
subsequent rSPR moves, we believe that Table 1 suggests that the data prob- 
ably arose from one recombination event, which agrees with the established 
knowledge concerning these taxa. 

Furthermore, the trees which constNJ finds assuming a single recombination 
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Figure 3: One pair of phylogcnetic trees constructed for the same dataset using 
constNJ. In contrast to Figure 2, the only difference between the two trees is 
the location of the BG recombinant clade. These two trees fit into a recombi- 
nation network with a single recombination event, as should be the case for a 
tree for the pure subtypes with a single recombinant strain like we have here. 
constNJ correctly identifies that the BG subtype is a recombinant of the B and 
G subtypes. 

event agree with the accepted recombination history of the BG recombinant 
circulating form (Figure 3). In particular, the only difference between them 
is the location of the BG clade, which switches from the G to the B subclade 
depending on the region analyzed. Importantly, these two trees can fit into a 
recombination network with a single reticulation node, in contrast to those in 
Figure 2. 

We believe that constNJ is the first algorithm of its kind, but will now review 
literature on related topics, starting with common terminology. The currently 
accepted term for the class of networks including both hybridization and recom- 
bination networks is "reticulate network". 1 If we consider a rooted tree to be 
a directed graph such that edges are directed away from the root, a reticulate 
network is a rooted phylogcnetic tree with additional directed edges making 
a directed acyclic graph with "tree nodes" of in-degree one and "reticulation 
nodes" of in-degree two (?). 

A considerable amount of work has gone into the problem of constructing a 
reticulate network given a set of phylogcnetic trees which it must contain. This 
problem was initiated by ? and considerable progress has been made by ?, ?, ?, 
and ??. As described above, we differ from these approaches as we would like 

x This terminology is redundant, as the word "reticulate" already means network-like. 
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to estimate the trees while ensuring that they fit into a reticulate network. 

A related problem (which was the original motivation for fitting trees into 
a network) is to reconcile distinct gene trees into a single species tree. This 
problem has received an appropriately large amount of attention, and has found 
a more realistic model-based formulation in ? and ?. These differ from the 
present paper because they assume that there is a single species tree, and that 
"correctness" of a gene tree should in part be judged by the degree to which 
it fits within a species tree due to a coalescent model. In our setting, however, 
there is no single species tree, and the coalescent model may not be appropriate. 

Sometimes a related assumption is made, which is not that complete species 
trees are known, but that the resulting recombination network must display 
a specified collection of bipartitions, which are typically called splits. This 
is equivalent to assuming a supplied alignment evolves according to the infinite 
sites model of mutation. The problem again is to find a network which minimizes 
the number recombination events. This problem was first formulated by ?, and 
was shown to be NP-hard in ?. Progress was made in a sub-case by ? and 
a simpler related (and in some ways more realistic) problem was solved by ?. 
In ?, the authors note that the algorithm in ? can be extended to this case. 
Although a different formulation, this splits/infinite sites approach represents a 
different version of the same strategy: find the network displaying a certain set 
of characteristics which minimizes the number of reticulation events. 

Splits network methods are a biologically useful and mathematically inter- 
esting way of understanding conflicting signals in phylogenetic data. The first 
method to construct splits networks from distance data was the split decompo- 
sition approach of ??. Another successful approach has been the "neighbor- net" 
algorithm created by ? and further analyzed by ?. These methods form a useful 
complement to phylogenetic analysis in the traditional tree-based sense, but do 
not reconstruct an explicit evolutionary history. We also note that recombina- 
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tion networks need not be circular split systems, which are the sorts of splits 
networks returned by neighbor- net. 

On the other end of the spectrum lie likelihood-based methods using the 
coalescent with recombination (?). Major recent advances have been made 
in this area. The full likelihood is quite daunting to compute, but ? have 
a parsimony-based approach which saves on computation by several orders of 
magnitude. Importance sampling (?) is also promising, but is not yet efficient 
enough for the long alignments typically encountered in phylogenetics. Also, 
it is the intent of this paper to construct a method which is independent of 
population genetics models such as the coalescent. 

A related though distinct line of research is the inference of recombination 
breakpoints. One of most basic and most commonly used methods for the infer- 
ence of recombination breakpoints is called "bootscanning" , whereby a window 
is scanned along the alignment and a phylogenetic tree is built for each position 
of the window; a change in topology between sections of the window can be inter- 
preted as evidence for a recombination breakpoint (?). There are many different 
variations on this theme. One promising line of research by Marc Suchard and 
collaborators apply multiple change-point models and reversible-jump MCMC 
to estimate trees and model parameters along the alignment (??). We also note 
that sometimes recombination breakpoints can be seen "with the naked eye" as 
in ?. In contrast to our paper, it is not the intent of these methods to accu- 
rately infer phylogeny; furthermore they do not posit any relationship between 
trees in neighboring no-recombination blocks. Furthermore, some of the more 
computationally intensive methods actually require a fixed reference tree. 

In summary, we are not aware of any available method for building reticulate 
networks which gives an explicit rooted phylogenetic history for each column 
of the alignment, which elucidates the balance of discord between the trees 
and optimality for those trees, and which is efficient enough to be useful for 
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modern data sets. The lack of practical phylogenetic algorithms in the presence 
of recombination was recently demonstrated in a simulation study by ?. ? have 
noted the lack of useful reconstruction algorithms for host-parasite relationships 
and have noted the need for an algorithm which balances tree concordance and 
optimality as constNJ does. Although far from a complete solution for these 
cases, we believe that constNJ is a first step in the right direction. 
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2 General description of constNJ 

The primary input for constNJ is a collection of alignment blocks, which as 
described are disjoint subsets of columns of the alignment which are assumed 
to evolve in a tree-like manner. In the case of alignments with recombinant 
sequences, the alignment blocks are simply the no-recombination blocks. Note 
that the alignment blocks need not be contiguous; for example a single recom- 
bination event with two recombination breakpoints will result in two, not three, 
alignment blocks. The other input for constNJ is a sequence of constraints 
on the rSPR distance between the trees constructed for the no-recombination 
blocks as described below. Given this input, the goal of constNJ is to exhibit 
the balance between discordance among the alignment-block trees on one hand, 
and optimality of the trees in some phylogenetic sense on the other. 

constNJ is a deterministic distance-based approach to reconstruction; we 
chose this direction for several reasons. First, the underlying space for a likeli- 
hood optimization scheme is even larger than usual, making a heuristic search 
even less appealing: there are [(2n — 3)!!] fc fc-tuples of rooted bifurcating phy- 
logenetic trees on n taxa. The sorts of constraints we will be imposing reduces 
this number substantially, but little is known about the resulting graph under 
the sorts of moves typically used in heuristic phylogenetic searches. Further- 
more, likelihood-based approaches are substantially improved by starting with a 
reasonable tree, which in modern applications is typically a distance-based tree. 
Thus, even if a likelihood-based approach was the eventual goal, a distance- 
based approach would be useful as a "seed" for the heuristic likelihood search. 
Finally, we feel that distance- and likelihood-based algorithms occupy distinct 
and complementary roles in the world of computational phylogenetics. 

Our goal is to design an approach which generalizes the remarkably accu- 
rate and hugely popular neighbor-joining algorithm (?). Remarkably, it took 
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almost 20 years for the phylogenetics community to learn the objective function 
of neighbor-joining; during that time it was even claimed that no such objec- 
tive function existed. However, it is now known that neighbor-joining greedily 
optimizes the "tree length" £(T, D) (defined below in Equation 1) for the given 
distance matrix D. constNJ generalizes this objective function, as it attempts 
to minimize the total length of all of all k trees (2) by a combination of greedy 
steps. 

The trees resulting from constNJ are constrained by the user to be some 
specified number of rooted subtree-prune-regraft (rSPR) moves from one to 
another. As displayed in Figure 1, reticulation events such as recombination and 
hybridization correspond to rSPR tree rearrangements. The converse is not true: 
arbitrary rSPR tree rearrangement events need not correspond to reticulation 
events. For recombination or hybridization to take place, the participants in 
the event need to exist at the same time; it is not hard to set up examples 
of rSPR move combinations which violate this fact (see, e.g., Song and Hein, 
2005) . Methods have been developed which take timing restrictions into account 
(??), but we do not incorporate these ideas into a phylogenctic reconstruction 
framework. This may be an interesting avenue to for future research, but on 
the other hand seeing such timing violations can actually be informative. First, 
there may be something wrong with the data. Second, it has been noted (?) 
that reticulation networks can appear to violate timing constraints if certain 
taxa are not sampled. The problem of determining the minimal number of 
"missing" taxa required to explain timing constraints has been analyzed by 
?. Therefore we have left interpretation of timing issues up to the user of the 
program. 

We now make a more formal statement of the problem constNJ attempts 
to solve; note that a similar formulation was made independently by ? in the 
context of host-parasite relationships. 
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Problem 1 (rSPR-constrained balanced minimum evolution). Given k n x n 
distance matrices D\, . . . , Dk and a symmetric kxk constraint matrix C, find the 
set of trees T\,...,Tk minimizing Y^i=i ^Pi> A) such that c? r sPR(7i, Tj) < dj 
for each i and j. 

Theorem 2. constNJ is a consistent algorithm to solve Problem 1. 

For constNJ we proceed in a manner analogous to that for neighbor-joining. 
The neighbor-joining algorithm starts with all taxa connected to a central node, 
then at every stage, chooses the "coalescence" (in other papers, "amalgamation) 
of trees which most decreases the value of the total tree length. We mimic this 
philosophy by evaluating coalescences based on how they affect the total tree 
length. However, in the end we must come up with a collection of trees T\ , . . . , Tfc 
that satisfy the prescribed rSPR constraints. This raises the question of how one 
might bound the rSPR distance of the eventual trees "ahead of time," i.e. before 
the termination of the coalescence steps. For instance, if in the developing trees 
one has the subtrees (a, b) for the first distance matrix, and (a, c) for the second 
distance matrix, it is clear that the resulting trees must have rSPR distance at 
least one between trees T\ and T2. 

The question of how to bound eventual rSPR distance is solved by Theo- 
rem 19. Specifically, we generalize m, the size of the maximum agreement forest 
(?) to these partially coalesced trees, which forms a sharp bound. In short, the 
m value for a pair of partially coalesced trees T and S is the minimum rSPR 
distance possible among trees resulting from coalescences of T and S; thus once 
a pair of partially coalesced trees achieves an m value above the corresponding 
constraint, we can throw that pair out, as the eventual resolved trees will never 
satisfy the constraints. 

Using this m we construct our greedy algorithm, as shown in Figure 4. Say 
that we only have two trees, and that we want to find the minimal-total-length 
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Figure 4: Schematic diagram of the constNJ algorithm. As described in the 
text, at every stage we attempt to find the optimal pair of partially coalesced 
trees which could eventually be at most some fixed number of rSPR moves 
apart. As shown in Theorem 19, the m value for a pair of partially coalesced 
trees forms a sharp lower bound for the eventual rSPR distance between those 
trees. Therefore pairs of partially coalesced trees which have m value exceeding 
the constraint on rSPR distance can be thrown out, as shown by the X. 

pair of trees which are only one rSPR move apart. At every stage, we attempt 
to find the best pair of partially coalesced trees with m values zero and one. 
We start with two star trees; m applied to this pair is zero. The first-step 
coalescence must also lead to a pair of trees which have m value zero, as one 
of the trees is still completely unresolved. Say the optimal, in terms of total 
tree length, second step NJ-type coalescence leads to a pair of trees which have 
m value one (indicated by the first diagonal arrow in Figure 4). Then we go 
down the list of second-step coalescences for the trees, and find the best one 
which does not increase the m value at all (indicated by a horizontal arrow in 
Figure 4). Next we repeat the process for each of the trees from the previous 
stage, saving the best pair of trees which have m values zero and one. In the 
end, we will have the best pair of trees which have rSPR distances zero and one 
which were achievable via a series of greedy steps. Although not guaranteed to 
be the optimal pair of trees, the algorithm is consistent. 
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3 Technical preliminaries 

In this section we review some definitions and clarify our notion of optimality. 
As stated in the introduction, we will always assume that an outgroup taxon has 
been chosen, and will label it p. Thus we always assume that p is contained 
in any taxon set A. We will use the following definitions. For the purpose of 
this paper, a tree on a finite taxon set A will be a rooted binary phylogenetic 
X-tree. A forest on taxon set X will be a collection of trees on disjoint taxon 
sets such that the union of the taxon sets is X. We will sometimes consider a 
tree on A to be a forest with a single tree. An unrooted tree on a finite taxon 
set X will be an unrooted phylogenetic X-tree (note that unrooted trees will be 
allowed to have multifurcating nodes.) C(R), £(R), and V(R) will denote the 
leaves, edges, and vertices of a tree, unrooted tree, or forest R. 

Although p represents the true rooting of the phylogenetic tree, we will not 
always assume that our trees or forests are rooted at p. We must do so 
because the NJ-type coalescences will not in general root the tree at the edge 
leading to p. Therefore, we must allow alternative roofings, but at the same 
time keep in mind that the rSPR distance between the trees must be calculated 
with respect to the edge leading to p. Thus we use the following definition of 
rSPR on an unrooted tree: given an unrooted tree U on a taxon set A 9 p, a 
single SPR move first cuts some edge of the tree except for that leading to p, 
resulting two rooted trees R and S. Say p G C(R). Suppress the degree two 
root node of R, and attach S to some edge of the resulting unrooted tree by 
inserting a degree two node onto the chosen edge, then connecting the root of 
S to that new node. This definition is the same as that of ? when considering 
trees rooted at the edge leading to p. 

As with any distance function defined implicitly in terms of a graph, the 
minimum number of rSPR moves required to transform one tree T into another 
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S is a metric; we define (I t spr(T, S) to be this number. 

3.1 Tree length and the balanced minimum evolution cri- 
terion 

As reviewed by ?, phylogenetics researchers now understand the optimality func- 
tion of the neighbor-joining algorithm (?). Let p(i, j) denote the path from i to 
j in the unrooted tree T, and define the weight of a path from leaf i to leaf j as 

Wihj)= ^ delR-T 

Then the "length" of an n taxon tree T with respect to an n x n distance matrix 
D is (?): 

£(T,D) = Y / w(i,j)D iJ . (1) 

The name "tree length" comes from the fact that if D is a distance matrix 
derived from some assignment of branch lengths to the edges of T, then I will 
be the total length of all of the edges. However, the name may be somewhat 
confusing initially, because i need not be defined as sum of the branch lengths 
of any specific tree. 

The tree T which minimizes £(T, D) for some distance matrix D is known 
as the balanced minimum evolution (BME) tree for the distance matrix D. 
The BME criterion is consistent (?), and neighbor-joining is a consistent tree- 
building heuristic which greedily minimizes total tree length (?) As described 
in Problem 1, constNJ attempts to minimize 

k 
i=l 

while enforcing pairwise constraints on the rSPR distance between pairs of trees. 
When k = 1 constNJ is simply neighbor joining, while for k > 1 constNJ is a 
strict generalization of NJ. 
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4 Rooted SPR and maximum agreement parti- 
tions 



This section describes the primary technical content of this paper. As described 
in the introduction, we would like to proceed via coalescences in a manner simi- 
lar to neighbor-joining, while ensuring that the eventual rSPR distance between 
the trees is not too large. In order to assure adherence to the rSPR criterion, 
we develop the notion of maximum-agreement partition, which generalizes the 
notion of maximum agreement forest from ?. As shown in Theorem 19, max- 
imum agreement partitions and the associated m value allow us to bound the 
rSPR distance between the two partially resolved trees "in advance." 

4.1 Compatibility and coalescence 

We will use the following definitions. A split on a taxon set A is a bipartition of 
X. Because the set X will be clear, we will often abuse notation by identifying 
A C X with the partition A\(X \ A). Furthermore, because we have a special 
element p, we can distinguish between the two sides of a split; the side not 
containing p will be called the rsplit (short for rooted split) of the split. It is 
clearly equivalent to describe a given partition in terms of a split or an rsplit, 
and we will use the two descriptions interchangeably. 

Note that the neighbor-joining algorithm is typically thought of as proceed- 
ing by coalescing internal nodes of an unresolved phylogenetic tree (see Figure 4); 
however for our purposes it will sometimes be easier to consider the forest ob- 
tained by deleting the central node and the associated edges. The opposite 
construction will be called "starification." 

Definition 3. Given a forest F, define the starification ~k(F) of F as the 

following unrooted tree. If F has one tree, then suppress the degree two root 
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node of F. If F has two trees, then join their root nodes by an edge. If F has 
three or more trees, join all of the root nodes of trees of F to a single node. The 
new introduced node will be called the star node. 

We will identify any one, two, or three tree forest F with its starification, in 
which case there is no designated star node. 

Definition 4. Given a tree T which is part of a forest F on a taxon set X, 
define the edge splits £.e(T) to be C(T)\[C(T)] C along with the set of splits on 
X induced by the edges ofT. We define T,e(F) to be the union of the edge splits 
ofT across all trees T in F. 

For example, the rsplits {3} and {2,3,4} are both edge rsplits of the forest 
((l,p),2);(3,4). 

Definition 5. Given a forest F on a taxon set X , A is a separating split of F 
if A is the union of taxon sets for a collection of at least two trees of F. The 
set of separating splits of F will be denoted Es(F). 

Given a forest F we will write E(F) for V E (F) U S S (F). This will be the set of 
splits used to make agreement partitions as described below. 

Definition 6. Two rsplits A and B will be called compatible if either APiB = 0, 
AC B, or B CA. 

Because A and B are the sides of the splits which do not contain p, this is 
the same as the usual criterion for split compatibility (?). Therefore we have 
the following well known theorem. 

Theorem 7 (Buneman, 1971). A collection of splits M on a taxon set X is 
pairwise compatible iff there exists an unrooted tree T on taxa X such that M 
is a subset o/Ss(T). There is a one-to-one correspondence between compatible 
sets of splits on X and minimally-resolved unrooted trees on X. 
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Definition 8. Two forests F and G on taxon set X are compatible 
and Y,e{G) are pairwise compatible. 

Definition 9. The join T A S of two trees T and S on disjoint taxon sets is 
the tree obtained by joining the root nodes of T and S to a new root node. The 
coalescence ofT and S in the forest F is the forest {T A S} U (F \ {T, S})}. 

Note that the operation of coalescence gives a partial order on the set of 
forests on a given taxon set. Namely, we write F >; G if F is a coalescence of 
G. Clearly, trees are the maximal elements in this partial order. 

Definition 10. A tree S is a subtree of an unrooted tree U if S is one compo- 
nent of the disconnected graph obtained by cutting an edge of U. A tree S is a 
subtree of a rooted tree T if S is a component of the disconnected graph obtained 
by cutting an edge of T, and S does not include the root of T. 

We emphasize that the subtree definition is different than than that of an 
induced subtree, which is as follows. The existence of induced subtrees is guar- 
anteed by Theorem 7 or its rooted equivalent. 

Definition 11. Given a tree T and Y C C(T), T\y is the (rooted or unrooted) 
tree on taxa Y with rsplits {A D Y : A e S S (T)}. 

There is also an analogous definition for forests. 

Definition 12. Given a forest F and Y C C(F), F\ Y is 

{T\ Y :T eF and L(T) (17^}. 

Proposition 13. // two forests F and G on a taxon set X are compatible, and 
F has more than one tree, then there exists H >~ F such that H is compatible 
with G. 

Proof. If F has two or three trees, the proposition is trivial. Otherwise, let U 
be the tree with split set equal to the union of T<e(F) and T,e(G). If U is not 

20 



resolved (i.e. if there exists an internal node of degree greater than three) then 
take an arbitrary resolution. As all of the trees T of F are resolved, each T sits 
as a subtree of U; let J be the union of the nodes of the T £ F (considered as 
nodes of U). Let p denote the longest path in U which does not contact any of 
the nodes in J. Because F has at least four trees, p will be nontrivial. Pick one 
end of this path, which must be connected to a pair of trees 5", 5"' of F. Let 
K = C(S') U C(S"). As the split K\K C is already a split of U, we know that it 
is compatible with T,e(F) and £#(£?), and thus that S(S" A S") is compatible 
with T, E (G). Let H be the coalescence of S' and S" in F. □ 

4.2 Maximum agreement partitions 

In this section we introduce the notion of maximum agreement partition (MAP), 
which generalizes the idea of maximum agreement forests. Maximum agreement 
forests were first introduced by ?, and further refined by ?. In broad terms, 
given two forests F and G on a taxon set X, we will be interested in considering 
partitions P which are obtainable from F and G independently by "combining" 
edge splits and separating splits of those forests, in the same way that edge 
cuts are combined when making maximum agreement forests. The appropriate 
notion of "combining" splits is the infimum, which we now describe. 

The set of partitions on a given finite set Y form a partial order, such that 
a partition Pi < P 2 if Pi is a refinement of P 2 . In fact, the set of partitions 
is a complete lattice, meaning that any set of partitions on Y has a supremum 
and an infimum. For a collection of partitions M, we will use inf(M) to denote 
their infimum. 

Thus, as described below, a necessary condition for P to be an agreement 
partition for two forests F and G is that P can be expressed as inf(M) and 
inf(A) for M C E(F) and N C E(G). It will now be useful to connect that 
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definition to one in terms of convexity of characters (?). 

Definition 14. Given a partition P on some set K , define P\j for some J C K 
to be the partition [Y n J : Y e P}. 

The following is a slight generalization of the definition of convexity given 
by?. 

Definition 15. A partition P on a taxon set X is convex on a forest F on X 
if there exists an H > F such that P induces a convex character on if(H), i.e. 
if there exists a partition P on vertices V(if(H)) such that 

(i) P = P\x- 

(ii) Any Y E P separates *k(H) into connected components. 

The following proposition relates the notions of "obtainable by a series of 
cuts along edge or separating splits" with the notion of character convexity. 

Proposition 16. A partition P of a taxon set X is convex on a forest F iff 
there exists M C E(F) such that P = inf(M). 

Proof. Assume M C E(F) such that P = inf(M). Note that K = inf(M n 
Es(F) is a set of disjoint separating or root-edge splits for F; thus we can 
perform coalescences, making H, such that the splits from sets in K are edge 
splits of H . Such an H will satisfy the criteria of the definition. 

For the converse implication, cutting any edge (u, v) of *k(H) for any H y F 
gives a split in E(F). We then define M as the set of such splits s U; „ such that 
such that (u, v) is an edge and u and v are in distinct sets of the partition P. 
By construction, P = inf(M). □ 

The following definition generalizes the notion of agreement forest. 
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Definition 17. We say that a partition P of taxon set X is an agreement 
partition for a pair of forests F,G on X if 

(i) for every pair of rsplits A 6 ^e(F), B e and Y 6 P, AC\Y is 

compatible with B PiY. 

(ii) P is convex on F and G. 

We say that P is a maximum agreement partition (MAP) if the number of sets 
of P is less than or equal to that of any other agreement partition. Let m(F, G) 
be the number of sets in the MAP minus one. 

Note that by Theorem 7, for two resolved unrooted trees U, V on a taxon 
set X 9 p, the size of the maximum agreement partition is the same as the size 
of the maximum agreement forest of the trees (rooted at p) in the sense of ?. 
Recall that the definitions of maximum agreement forest in ? differs from that 
of ? and ?. 

Proposition 18. Assume F , G, and H are forests on a taxon set X such that 
H >z F, and P is an agreement partition for H and G. Then P is also an 
agreement partition for F and G. 

Proof. Part (i) of the definition is clear as T,e(F) C Y,e{H). Next we check 
(ii), i.e. that P is convex on F. Note that H >z F implies T,(H) C £(F), as 
the "extra" edge splits of H will be separating splits of F. By Proposition 16, 
there exists an M C E(if) such that P = inf(M); by the previous sentence 
M C £(.F) and so by Proposition 16 again P is convex on F. □ 

The following theorem is the main motivation for studying the maximum 
agreement partition. Thus the proposition says that the size of the maximum 
agreement partition of the two forests F and G is the same as the rSPR distance 
in the best case. 
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Theorem 19. The minimum of d r spR.(?7, V) across all unrooted trees U > F 
and V >: G is equal to m(F, G). 

The proof of this proposition will come after two lemmas. 

Lemma 20. Given a partition P convex on F and Y G P such that F\ Y 
includes two distinct trees T and S, then there exist distinct trees T,S G F such 
that (fAS)' 

R G {f, S}, we have either Z C C{R) or Z n C(R) 



= T A S. Furthermore, for any Z G P not equal to Y and any 



Proof. Let H and P be as in Definition 15. Let Y G P be such that Y <1C{F) = 
Y. Let f (resp. S e F) be the tree such that f\ Y = T (resp. S\ Y = S). Let 
Q = £(T)U£(S). 

We now show that T ^ S. The contrary would imply if(H)\Q = T\q. 
Because Q C Y and if(H)\ Y is connected by definition, T\q is connected so T 
and S would not be distinct. This is a contradiction. Thus C(T) C(S) = so 
(fAS) = f\ Y A S\ y = T A S. 

We now show the second statement of the lemma. Let r{W) denote the 
root node of any tree W G F. Note that r(T) and r(S) must be in Y because 
ir(H)\Q is connected and the path between any a G C(T) and b G C(S) passes 
through r(f) and r(S). 

Now assume that for some R G {T, S} we have that some Z ^ Y of P 
intersects C(R) but is not contained in it. Take c G ZC\C{R) and d G ZCi[C(R)] c . 
Let Z G P be such that Zf)L(F) = Z. By the same argument as in the previous 
paragraph, r(R) is in Z. This is a contradiction as Y and Z are disjoint. □ 

Lemma 21. Assume that F and G are forests on a taxon set X , and P is an 
agreement partition for F and G. Then there exist resolved trees U > F and 
V y G such that P is an agreement partition for U and V. 

Proof It is enough to show that if one of the forests, say F, has at least four 
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trees then there exists an H >- F such that P is an agreement partition for H 
and G. 

If for every Y <E P we have that F|y is a single tree, then we can make H by 
taking an arbitrary coalescence of F; any such coalescence will be "broken" by 
P and thus will not introduce any splits violating (i) of Definition 17. Thus we 
assume that F\y has at least two trees. By Proposition 13, there exist nontrivial 
T,S £ F\y such that the coalescence of T and S in F\y is compatible with G\y- 



T A S 



= TAS. Let H 

Y 



By Lemma 20, there exist T and S in F such that 
be the coalescence of T and S in F; the second statement of Lemma 20 implies 
that the coalescence of T and S does not introduce any new edge splits when 
restricted any Z ^ Y in P, and so H satisfies the criterion (i) of a maximum 
agreement partition. 

Also, P is convex on Ho, establishing criterion (ii). Indeed, by Proposi- 
tion 16, let M C £(.F) be such that P = inf(M); we need to show that 
M C E(flo)- The only difference between T,(F) and E(#o) is that E(flo) 
does not have separating partitions which separate T and S, but M cannot 
contain such a partition because T and S both have taxa in the same partition 
of P. □ 

Proof of Proposition 19. Lemma 21 shows that the minimum of m(U, V) is less 
than or equal to m(F, G). The other inequality follows from Proposition 18. 

Now note that for a resolved tree on X rooted at p, the notions of maximum 
agreement forest and maximum agreement partition coincide. Thus by Theo- 
rem 2.1 of ?, m(U, V) is equal to the rSPR distance between U and V for any 
resolved U >- F and V >- G. □ 
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4.3 Calculating the maximum agreement partition 

As introduced above, and described more clearly below, constNJ needs to find a 
great number of agreement partitions. Indeed, a sample constNJ run with three 
distance matrices, 27 taxa, with pairwise constraints of size two required 5867 
calls to the subroutine finding the size of a MAP. Therefore a speedy calculation 
of the MAP is essential. 

In the present implementation of constNJ, the MAP is calculated is via a 
simple extension of the algorithm by ?. As with the usual Bordewich-Semple al- 
gorithm, we contract isomorphic subtrees and replace chains of pendant subtrees 
with chains of three pendant edges. However, we consider separating rsplits as 
well as edge rsplits to find the agreement partition. 

An alternative would be to consider an integer linear programming (ILP) ap- 
proach to the MAP problem based the work of Yufeng Wu, who has recently de- 
veloped an ILP approach to finding a maximum agreement forest (?). Although 
Wu's ILP approach is many orders of magnitude faster than the Bordewich- 
Semple algorithm for finding the size of the maximum agreement forest in the 
"hard" case when two trees are quite different, our tests have shown that it is 
slower in the "easy" case. This difference is probably because there is overhead 
to creating the linear programming matrix, which does not scale strongly with 
respect to the difficulty of the problem, while the Bordewich-Semple algorithm 
is very fast for easy problems. It is possible that some of the ILP overhead could 
be amortized by clever re-use of portions of the matrix across coalescences, or 
a combination of Bordewich-Semple and Wu ideas, but we have not followed 
these directions. 
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5 The constNJ algorithm 



Assume constNJ is given k distance matrices on a taxon set X. On the way 
to constructing our trees T\ , . . . , on X we will be constructing collections 
of forests F = F\, . . . , i^; we will call such a collection F an "instance." For 
example, each boxed pair of trees in Figure 4 is an instance (after deleting the 
central "star" nodes). The agreement profile for an instance F is the k x k 
matrix a(F) where a(F)ij is m(Fi,Fj). It describes the degree to which the 
forests agree. The identical agreement profile is the k x k zero matrix. Define 
the instance tensor to be a partially filled tensor of instances indexed by N fc2 , 
where F is stored in the "slot" indexed by its agreement profile a(F). 

Algorithm 22 (constNJ). Given nxn distance matrices D\,...,Dk and a 

k x k constraint matrix C, 

1. Let F^ ) be the trivial instance, i.e. is the trivial forest on n taxa for 
each 1 < i < k. Let H^ ) be the instance tensor containing only ¥^ . 

2. Repeat the following until termination: 

a. Let H be the instance tensor from the previous step. 

b. Rank all possible coalescences of all of the instances of H by how 
much they will decrease total tree length. 

c. Make a "step " by walking down this ranked list in order as follows: 

i. Perform the chosen coalescence, say of an instance F, and as- 
sume that the resulting instance F' has agreement profile X. 

ii. Lf some entry of X is greater than the corresponding element of 
C, discard F' and test the next coalescence. 

Hi. If not, and F' is the first in this step to have agreement profile X , 
then save it. If, on the other hand, another instance has already 
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been found in this step with agreement profile X, then discard F' 
as it must have a larger total tree length, 
iv. Stop walking down the list if X is the identical agreement profile, 
d. Terminate if each of the Fi have three trees or fewer. 

We now show that this algorithm is consistent. 

Proof of Theorem 2. In broad terms, Algorithm 22 is consistent because of the 
consistency of neighbor-joining (??) and because the coalescence which most 
decreases the total tree length must be a neighbor-joining step (?). We are given 
a sequence of distance matrices D\ , . . . , Dk and a symmetric k x k constraint 
matrix C. By hypothesis, these distance matrices come from a sequence of 
trees Ti , . . . , Tk such that the rSPR distance between Ti and Tj is bounded 
above by C\j. First, by the consistency of neighbor-joining, NJ applied to each 
distance matrix independently will recover the correct collection of trees. Say 
the sequence of neighbor-joining coalescences making Ti gives a series of forests 
F\ t i, . . . , Fn-ij, where F„_i^ = Ti. Thus by Theorem 19 (more specifically, 
Proposition 18) and our assumptions about the T,, 

m(F a ,i,F btj )<Cij (3) 

for any 1 < a, b < k and 1 < i,j < n. Thus the constraints will always be 
satisfied as long as we follow the sequence of NJ steps for each tree. 

Next we show by induction that given this data, at every step every constNJ 
forest will one of the F r j for 1 < r < n and 1 < j <n — l. This is clearly true 
at initialization. By induction, assume the assertion is true at some constNJ 
step. Consider the coalescence which decreases total tree length as much as 
possible irrespective of constraints; say it occurs in Fk.i- As the coalescence 
decreases the tree length of Fk^i compared to other coalescences of Fk^i, is also 
a neighbor-joining step for Fks, making Fk+iA- By the previous paragraph, 
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we know that this coalescence will preserve the constraints, and thus is also a 
constNJ step (recall that each constNJ step decreases the total tree length as 
much as possible amongst coalescences which preserve the constraints). Thus 
at the end we get i^-i.i for each i by induction. Because F n -\^ = T i} constNJ 
is a consistent algorithm. 

□ 

5.1 Implementation 

We have implemented constNJ in the fast functional/imperative language ocaml 
(?). The implementation has a simple command line interface, which is doc- 
umented in the accompanying manual. It is available for download from the 
author's website, at http://www.stat.berkeley.edu/-matsen/constNJ/ . 

As described above, the primary input for constNJ is a series of distance 
matrices, with one for each alignment block. The program is designed to ac- 
cept distance matrices from the DNADIST program of the PHYLIP package, 
although longer lines and taxon names are allowed. The first taxon is assumed 
to be the outgroup. The program assumes that the taxa in the distance matrices 
are ordered in a corresponding way. For instance, if one is using constNJ to 
investigate recombination, all of the taxa should be listed in the same order, so 
that the taxa in the no-recombination blocks correspond to one another. On the 
other hand, if one is using constNJ to investigate host-parasite relationships, 
the ith taxon in the parasite alignment should parasitize the ith taxon in the 
host alignment. If, for example, a given parasite is present in multiple hosts, 
this will require duplication of that parasite sequence in the alignment. 

The second input for constNJ is a set of constraints for the resulting corre- 
lated set of trees. There are two options for specifying these constraints: first, 
via a file, or second, by enforcing "linear" constraints. For example, assume we 
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supply three distance matrices: D , D\, and D 2 , and would like to construct 
trees T , T\, and T 2 . To specify constraints for these matrices, one writes one 
constraint per line, with first the indices of the distance matrices then the num- 
ber of rSPR moves allowed between those distance matrices. For example, a 
line saying 2 1 would mean that T and T 2 are constrained to be one rSPR 
move apart. On the other hand, one may specify a linear constraint with a 
linear constraint parameter. If the linear constraint parameter is L, then trees 
Tj and Tj are constrained to be L ■ \i — j\ rSPR moves apart. So if we apply a 
linear constraint with parameter 2 in our example, then both T and Ti and T\ 
and T 2 are constrained to be at most 2 rSPR moves apart, while T and T 2 are 
constrained to be at most 4 rSPR moves apart. 

The output for constNJ is collection of correlated sets of trees, each of which 
get their own . tre file, along with a . lengths file, which describes the total tree 
length for each of these sets of trees. constNJ returns at most one correlated set 
of trees for each agreement profile within the constraints, which is labeled by 
the agreement profile. If the constraints are given in a file, then the agreement 
profile is written in the order given in the file. If linear constraints are given, 
the agreement profile is written as a vector representing an upper triangular 
matrix in the usual way. For example, the agreement profile for three trees with 
linear constraints is written (c? rS pR(T , Ti), d rSPR (T , T 2 ), d r spR(7i, T 2 )), so the 
set of trees in the file example . 2_1_1 .tre has agreement profile (2,1,1). The 
. lengths file contains the information on tree lengths, as in Table 1 of the 
introduction. Namely, for each correlated set of trees returned by constNJ, it 
displays the total tree length for those trees. 
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5.2 Speed 

A rigorous worst-case runtime analysis of constNJ would show that it can be 
incredibly slow. Indeed, the maximum agreement partition is a generalization of 
the maximum agreement forest; thus finding the size of the MAP is NP-hard by 
the corresponding theorem by ?. However, constNJ docs not just need to solve 
one such problem, it needs to solve quite a number of them. At worst, constNJ 
would need to find as many MAP's as there are possible coalescences, just for a 
single step and a single instance; if an instance had forests with l\,...,lk trees, 
then there will be ( j 1 ) x • • • x (^|) possible coalescences, each of which in theory 
could require solving of a MAP problem. At any step there can be as many 
instances as there are agreement profiles satisfying the constraint matrices, and 
a problem with n taxa and k distance matrices will require nk such steps. Such 
an analysis would not give a very clear understanding of the practical time 
requirements of running constNJ. 

In practice, constNJ can be used effectively for a moderate number of taxa 
and a small number of closely-constrained trees. The running time depends 
somewhat on the number of taxa, but quite a lot on the constraints and num- 
ber of distance matrices. Indeed, the main bottleneck is the MAP calculation, 
and the running time of the MAP calculation depends very strongly on the 
constraints and the number of distance matrices. 

However, what may be surprising is how much the running time depends on 
the quality of the data. This is vividly illustrated by the simulations, where in 
the case of two trees with two reticulation events and divergence of 0.1 mutation 
per site per tree, the sequences with 100 sites took on average 10.3 minutes to 
run, while the simulations with 6400 sites took on average 0.68 seconds each. 
This represents a difference of almost three orders of magnitude. On the same 
processor (Intel ® Xeon ® CPU at 2.33GHz) using real HIV data, an example 



31 



with three 38-taxon distance matrices and pairwise constraints of three for each 
pair of distance matrices took 49 seconds, while an example with only two 40- 
taxon distance matrices with a single constraint of size three took almost 21 
minutes. The quality of the data impacts "how far" constNJ has to go down 
the list of coalescences in order to find one with the desired agreement profile, 
and how often it needs to calculate a new agreement partition. 

We have made some coding choices to increase the speed. For example, 
there is a natural partial order on agreement profiles, which is just the element- 
wise numerical order. In considering which coalescences to perform, we only 
investigate those coalescences which could lead to an agreement profile which is 
smaller than those which have already been performed. In principle, one could 
do a more comprehensive search which might lead to more optimal sets of trees; 
we have not found a significant improvement following such a direction. 
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6 Simulations 



In order to evaluate the performance of constNJ, wc performed a number of 
simulations. The trees in the study were generated as follows. We choose the 
number of trees in the recombination network, say k, the size of the trees, say 
n, and a number of rSPR moves, say m. We start with a tree T\ drawn from 
the Yule distribution of trees on n taxa. After choosing the desired expected 
number of substitutions on the tree (in simulations below, 0.1, 0.5, and 2), we 
divided this number by the number of non-root edges in the tree to get the 
expected number of substitutions per edge. We then drew the actual number of 
substitutions per edge from the exponential distribution with the corresponding 
mean to get the branch lengths of T\ . 

We then generated Ti+i from Tj by applying k rSPR moves to Tj as follows. 
For each rSPR move, first select a non-root edge uniformly; call the subtree 
below the chosen edge S. Cut off S then glue it back in on a uniformly selected 
edge of Ti not contained in S. The location along the chosen edge to attach £ 
is chosen uniformly. Then to simulate differential rates of evolution of different 
regions, take the average of the previous branch length and a branch length 
drawn from an exponential distribution as before. 

Given such a series of trees Ti, . . . , Tk, we generated a collection of distance 
matrices D\,. . . ,Dk by simulating sequences on the trees. We did so using the 
Jukes-Cantor model of sequence evolution with a single rate. Distances were 
then calculated using the Jukes-Cantor distance correction (?). In case the 
Jukes-Cantor correction gave an undefined value, we repeated the analysis with 
a new sequence. We chose the simple Jukes-Cantor model to focus attention on 
our method rather than the distance estimator. 

For the first set of simulations, we wanted to understand how the topolog- 
ical accuracy of constNJ compares to that of concatenating alignment blocks 
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Figure 5: constNJ simulation results for two trees, each on 30 taxa, averaged 
over 400 replicates. The first tree was drawn from the Yule distribution, and the 
second tree was made by applying a random rSPR move to the first. "constNJ" 
is our algorithm, "concatenated NJ" is neighbor-joining run with a concatenated 
alignment, and "independent NJ" is neighbor-joining run independently on the 
alignments for the different trees as described in the text. 

Figure 6: constNJ simulation results for two trees, each on 30 taxa, averaged 
over 400 replicates. This time, two rSPR moves were applied to the first tree to 
get the second. 

or running them independently. For concatenation, we estimated a single dis- 
tance for each pair of taxa by taking the Jukes-Cantor correction of the average 
number of substitutions in each alignment block; such a procedure simulates the 
process of concatenating equal- length alignment blocks. We then considered the 
resulting tree as the output of running NJ on the concatenated alignment for 
each alignment block. For independent construction, we simply ran NJ on each 
distance matrix independently For constNJ, we constrained the rSPR distance 
between the trees to be less than or equal to the number of rSPR moves used 
to generate the trees. The trees used in the comparison were then the shortest 
(i.e. smallest total tree length) trees returned given those constraints. 

To measure topological accuracy, we used the Robinson- Foulds distance (?), 
which is simply one half the size of the symmetric difference of the edge split 
sets. The results are shown in Figures 5, 6, and 7. In these simulations, constNJ 
typically outperforms either alternate strategy. When sequences are short, the 
main source of error is insufficiently accurate distance estimations; concatena- 
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Figure 7: constNJ simulation results for three trees, each on 30 taxa, averaged 
over 400 replicates. Here one random rSPR move was done to change the first 
tree to the second tree, and the second tree to the third. 

tion increases the amount of useful sequence information for distance estimation, 
and so outperforms independent construction in that case. However, constNJ 
does almost as well. On the other hand, when sequences are long, independent 
estimation does well, as there is enough sequence information to reconstruct the 
tree for each block independently. In that case, constNJ also does well. 

The reader may object that these graphs represent an unfair comparison, 
as they assume that the number of reticulation events is correctly bounded in 
advance. The next two simulations address this objection. The first set, with 
results shown in Figure 8, seems to indicate that by looking at the .lengths 
file one can do a reasonable job of deciding how many rSPR moves to allow as 
was done in the example case of the introduction. The second set, with results 
shown in Figure 9, concerns what happens if one makes an incorrect decision. 

Figure 8 explores one of the main themes of this paper, which is the trade-off 
between phylogenetic optimality (in this case total tree length) and congruence 
among individual trees. To make this figure, we generated pairs of trees as 
before, generating a Yule tree and then applying some number of rSPR moves 
to get the second tree, except that this time we threw out pairs of trees which 
did not have the correct rSPR distance between them (i.e. when a subtree was 
moved back to its original location). We drew branch lengths as above then 
simulated 1000 sites with an expectation of 0.5 mutations per site per tree. The 
x-axis is the index of the . lengths file, i.e. the number of rSPR moves between 
the two reconstructed trees. The left y-axis, "average total length" , shows the 
average length of trees with that number of rSPR moves between them. For 
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Figure 8: Comparison of the total tree lengths for simulated trees differing by 
the described number of moves and then reconstructed using constNJ; average 
of 100 simulations. As can be seen, the most significant decreases in the total 
tree length happen when getting to the correct number of rSPR moves, after 
which the plot levels off. For example, on the line "two rSPR moves," there 
are significant decreases in length when going to one and two rSPR moves, but 
not much decrease after that. Thus, at least in simulation, it appears possible 
to make a reasonable choice concerning the number of rSPR moves to allow 
between the two trees. 

instance, consider the point on the line labeled "three rSPR moves" which is at 
x- value 2. This says that if we simulate a pair of trees which are three rSPR 
moves apart as described above, then we expect the pair of trees output by 
constNJ with agreement profile two to have total length about 1.038. Note 
that constNJ does not always return a tree for every agreement profile which 
is allowed under the constraints. In those cases, we simply took the total tree 
length from the largest non-empty agreement profile. 

Figure 8 shows exactly what one might expect. Namely, if we generate pairs 
of identical trees, then not much improvement in terms of total tree length is 
gained by allowing the trees to differ. However, if the trees are one rSPR move 
apart, then there is a substantial drop when allowing one rSPR move, but not 
much more after that; this indicates that only one rSPR move is called for by 
the data. The situation is similar for the other numbers of rSPR moves. Thus, 
at least in simulation with good quality data, it appears that one should be able 
to make a reasonable judgment as to the correct number of rSPR moves for the 
data set at hand, as was done in the introduction. 

We also performed some simulations allowing an incorrect number of rSPR 
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moves (Figure 9). As shown there, giving a too-small constraint interpolates 
between results from concatenated data and the correct specification, while too- 
large constraints give performance similar to the correct constraint. One might 
expect constNJ with too-large constraints to give results similar to independent 
NJ; we do not have a clear explanation why this is not the case. 

Figure 9: Comparison of various specified constraints for constNJ; average of 
100 simulations. Data was simulated on two trees, each on 30 taxa, such that 
two random rSPR moves were done to change the first tree to the second tree. 
Then reconstruction was done with rSPR distance constraints of 0, 1, 2, 3, and 
4. As would be expected, having a constraint of (identical trees) has quali- 
tative performance similar to that of concatenated NJ, the best performance is 
obtained by the correct constraint of 2 moves, and a constraint of 1 gives results 
between those for and 2. The performance of 3 is similar to that for 2, while 
4's performance degrades with accurate distances. 
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7 Conclusion 



In this paper we present constNJ, a consistent distance-based algorithm for 
a collection of trees with pairwise rSPR constraints, such as those constraints 
satisfied by collections of trees which fit into a reticulation network. constNJ is 
deterministic and a strict generalization of the neighbor-joining algorithm. In 
order to ensure that the resulting set of trees satisfy the specified constraints 
on rSPR distance, we develop the theory of maximum agreement partitions, 
culminating in Theorem 2. 

We hope that this algorithm is the beginning of a new direction for phyloge- 
netic inference of reticulation networks. We simplify the problem considerably 
by assuming that the alignment blocks arc known in advance; in doing so we 
preserve the correlation between sites in the alignment with the same history. 
Rather than first finding trees and then attempting to put them into a recombi- 
nation network, we investigate the balance between discordance between trees 
and optimality of the ensemble of trees. By using trees whose rooting is derived 
from outgroups, we find explicit evolutionary histories. 

To our knowledge, constNJ is the first algorithm of its kind, and there is 
considerable room for improvement over this first attempt. First, we enforce 
pairwise bounds on the rSPR distance between trees, which is a relatively weak 
way to show that these trees fit into a network. A more explicit approach would 
be desirable. Second, distance-based methods waste a considerable amount 
of information which is used by likelihood-based methods; a logical next step 
would be to create a likelihood-based method. Doing so would require a col- 
lection of "moves" analogous to rooted nearest-neighbor-interchange or rSPR 
moves for a heuristic search, but for collections of trees, with the constraint 
that the moves don't change the rSPR distance between trees too much. One 
option would be to explicitly store a reticulate network in memory and have 
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the trees moving about inside the reticulate network while the network changes. 
Such an algorithm would do a better job of actually reconstructing a reticulate 
network. Third, an alternative direction for heuristic optimization might be to 
do a more complete search for the minimum length tree in a manner analogous 
to algorithms searching for the BME tree. Fourth, a more immediate issue 
is that constNJ does not reconstruct branch lengths. It would be possible to 
do a distance-based branch length estimation in a manner similar to that for 
usual neighbor-joining, but the fact that we are choosing trees which may be 
sub-optimal according to the NJ criterion implies that negative branch lengths 
might be encountered. Alternatively, one might use a program such as PHYML 
(?) to estimate branch lengths on each fixed topology independently. This ap- 
pears to be a reasonable way to proceed, despite the fact that the correlation 
between branch lengths of the different trees is lost. A more correct approach 
will require some sort of correlation of the branch lengths in a model-based 
manner, and we believe that such reconstruction is probably best done in the 
context of a complete likelihood-based approach as described above. Finally, it 
might be interesting to incorporate some constNJ ideas into bootscanning-type 
methods for recombination breakpoint inference. 

Eight years ago, ? wrote "[w]hen recombination occurs adjacent sites may 
have different, although correlated, genealogical histories. Reconstructing these 
genealogies with certainty is impossible." Although we do not claim certainty 
for this (or any forthcoming) algorithm attempting to reconstruct reticulate 
phylogenetic history, we think that there is cause for optimism and look forward 
to seeing future developments in this area. 
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rSPR distance 


total tree length 


tree length difference 





7.213 


0.0942 


1 


7.119 


0.0076 


2 


7.111 


0.0149 


3 


7.096 


0.0139 


9 (independent NJ) 


7.082 





Table 1: The balance between discord and optimality for the example HIV 
datasct. On the left side is the number of SPR moves required to go from the 
tree built on the G region to the tree built on the B region. In the center is the 
total tree length (see Equation 2), which is our notion of optimality. On the 
right is the difference of the total tree length between the rows. As described 
in the text, the largest drop in total tree length comes when allowing a single 
rSPR move (i.e. recombination event) between the two trees, indicating that 
one recombination event is needed to explain the data. 
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